精细混合：以微调语言模型减轻后门

论文标题

精细混合：以微调语言模型减轻后门

Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

论文作者

Zhang, Zhiyuan, Lyu, Lingjuan, Ma, Xingjun, Wang, Chenguang, Sun, Xu

论文摘要

深度神经网络（DNN）很容易受到后门攻击的影响。在自然语言处理（NLP）中，DNN通常是在带有中毒样本的大规模预训练的语言模型（PLM）的微调过程中进行了后式的。尽管PLM的干净重量很容易获得，但现有方法在防御NLP模型免受后门攻击时忽略了此信息。在这项工作中，我们迈出的第一步来利用预先训练的（未进行的）权重以减轻微调语言模型的后门。具体而言，我们通过两种互补技术利用清洁的预训练的重量：（1）两步的精细混合技术，首先将后式的重量（对中有毒数据进行微调）与预训练的重量混合在一起，然后将混合重量微调在一小部分干净的数据子集上进行微调；（2）一种嵌入纯化（E-PUR）技术，可减轻嵌入一词中存在的潜在后门。我们将精细混合与典型的后门缓解方法比较了三个单句情感分类任务和两个句子对分类任务，并表明在所有情况下，它都以相当大的差距来优于基本线。我们还表明，我们的电子纯度方法可以使现有的缓解方法受益。我们的工作建立了一种简单但强大的基线防御，可用于针对后门攻击的安全微调NLP模型。

Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks. In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples. Although the clean weights of PLMs are readily available, existing methods have ignored this information in defending NLP models against backdoor attacks. In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models. Specifically, we leverage the clean pre-trained weights via two complementary techniques: (1) a two-step Fine-mixing technique, which first mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained weights, then fine-tunes the mixed weights on a small subset of clean data; (2) an Embedding Purification (E-PUR) technique, which mitigates potential backdoors existing in the word embeddings. We compare Fine-mixing with typical backdoor mitigation methods on three single-sentence sentiment classification tasks and two sentence-pair classification tasks and show that it outperforms the baselines by a considerable margin in all scenarios. We also show that our E-PUR method can benefit existing mitigation methods. Our work establishes a simple but strong baseline defense for secure fine-tuned NLP models against backdoor attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题