Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

IF 4.2 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Transactions of the Association for Computational Linguistics Pub Date : 2023-12-01 DOI:10.1162/tacl_a_00622

Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu

{"title":"Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training","authors":"Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu","doi":"10.1162/tacl_a_00622","DOIUrl":null,"url":null,"abstract":"Abstract Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"184 ","pages":"1608-1623"},"PeriodicalIF":4.2000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of the Association for Computational Linguistics","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1162/tacl_a_00622","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过正则化持续预训练消除预训练模型中的后门

摘要最近的研究发现，预训练模型（PTM）在微调阶段之前很容易受到后门攻击。攻击者可以在 PTM 中植入可转移的任务无关后门，并控制模型在任何下游任务中的输出，这对所有下游应用都构成了严重的安全威胁。现有的后门清除防御措施主要针对特定任务的分类模型，并不适用于防御 PTM 的任务无关后门攻击。为此，我们首次提出了针对 PTM 的任务无关后门清除方法。基于后门 PTM 中的选择性激活现象，我们设计了一种简单有效的后门清除器，它以端到端的方式，通过正则化项对后门 PTM 进行持续的预训练。正则化项可以清除 PTM 的后门功能，而持续的预训练则可以保持 PTM 的正常功能。我们对不同模式和架构的预训练模型进行了广泛的实验。实验结果表明，我们的方法可以有效清除 PTM 内部的后门，并保留 PTM 的良性功能，只需少量与下游任务无关的辅助数据，如未标记的纯文本。在我们对有后门的 BERT 进行防御后，三个下游数据集的平均攻击成功率从 99.88% 降至 8.10%。代码可在 https://github.com/thunlp/RECIPE 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Transactions of the Association for Computational Linguistics Multiple-

CiteScore

32.60

自引率

4.60%

发文量

审稿时长

8 weeks

期刊介绍： The highly regarded quarterly journal Computational Linguistics has a companion journal called Transactions of the Association for Computational Linguistics. This open access journal publishes articles in all areas of natural language processing and is an important resource for academic and industry computational linguists, natural language processing experts, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, as well as linguists and philosophers. The journal disseminates work of vital relevance to these professionals on an annual basis.