对预训练模型的重量中毒攻击

论文标题

对预训练模型的重量中毒攻击

Weight Poisoning Attacks on Pre-trained Models

论文作者

Kurita, Keita, Michel, Paul, Neubig, Graham

论文摘要

最近，NLP看到了大型预训练模型的使用激增。用户在大型数据集上下载预训练的模型权重，然后在选择任务上微调权重。这就提出了一个问题，即是否下载未经信任的预训练权重可能构成安全威胁。在本文中，我们表明可以构建``重中毒''攻击，其中预先训练的权重注入了微调后，这些脆弱性在微调后暴露了``后门''，使攻击者仅通过注入任意关键字来操纵模型预测。我们表明，通过应用一种正规化方法，我们称之为波纹，而初始化过程，我们称之为嵌入式手术，即使对数据集知识和微调程序知识有限，也可以进行此类攻击。我们对情感分类，毒性检测和垃圾邮件检测的实验表明，这种攻击广泛适用，并构成了严重的威胁。最后，我们概述了针对此类攻击的实际防御能力。可以在https://github.com/neulab/ripple上获得复制实验的代码。

Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.

下载PDF全文

下载文献需遵守相关版权规定

论文标题