论文标题
对预训练模型的重量中毒攻击
Weight Poisoning Attacks on Pre-trained Models
论文作者
论文摘要
最近,NLP看到了大型预训练模型的使用激增。用户在大型数据集上下载预训练的模型权重,然后在选择任务上微调权重。这就提出了一个问题,即是否下载未经信任的预训练权重可能构成安全威胁。在本文中,我们表明可以构建``重中毒''攻击,其中预先训练的权重注入了微调后,这些脆弱性在微调后暴露了``后门'',使攻击者仅通过注入任意关键字来操纵模型预测。我们表明,通过应用一种正规化方法,我们称之为波纹,而初始化过程,我们称之为嵌入式手术,即使对数据集知识和微调程序知识有限,也可以进行此类攻击。我们对情感分类,毒性检测和垃圾邮件检测的实验表明,这种攻击广泛适用,并构成了严重的威胁。最后,我们概述了针对此类攻击的实际防御能力。可以在https://github.com/neulab/ripple上获得复制实验的代码。
Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.