硬标签黑盒环境下基于重要词鉴别器的文本对抗性攻击

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-11-19 DOI:10.1016/j.neucom.2024.128917

Hua Zhang , Jiahui Wang , Haoran Gao , Xin Zhang , Huewei Wang , Wenmin Li

{"title":"硬标签黑盒环境下基于重要词鉴别器的文本对抗性攻击","authors":"Hua Zhang , Jiahui Wang , Haoran Gao , Xin Zhang , Huewei Wang , Wenmin Li","doi":"10.1016/j.neucom.2024.128917","DOIUrl":null,"url":null,"abstract":"<div><div>In the hard-label black-box setting, the adversary only obtains the decision of the target model, which is more practical. Both the perturbed words and the sets of substitute words affect the performance of adversarial attack. We propose a hard-label black-box adversarial attack framework called VIWHard, which takes important words as perturbed words. In order to verify the words which highly impact on the classification of the target model, we design an important-word discriminator consisting of a binary classifier and a masked language model as an important component of VIWHard. Meanwhile, we use a masked language model to construct the context-preserving sets of substitute words for important words, which further improves the naturalness of the adversarial texts. We conduct experiments by attacking WordCNN, WordLSTM and BERT on seven datasets, which contain text classification, toxic information, and sensitive information datasets. Experimental results show that our method achieves powerful attacking performance and generates natural adversarial texts. The average attack success rate on the seven datasets reaches 98.556%, and the average naturalness of the adversarial texts reaches 7.894. Specially, on the four security datasets Jigsaw2018, HSOL, EDENCE, and FAS, our average attack success rate reaches 97.663%, and the average naturalness of the adversarial texts reaches 8.626. In addition, we evaluate the attack performance of VIWHard on large language models (LLMs), the generated adversarial examples are effective for LLMs.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"616 ","pages":"Article 128917"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VIWHard: Text adversarial attacks based on important-word discriminator in the hard-label black-box setting\",\"authors\":\"Hua Zhang , Jiahui Wang , Haoran Gao , Xin Zhang , Huewei Wang , Wenmin Li\",\"doi\":\"10.1016/j.neucom.2024.128917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the hard-label black-box setting, the adversary only obtains the decision of the target model, which is more practical. Both the perturbed words and the sets of substitute words affect the performance of adversarial attack. We propose a hard-label black-box adversarial attack framework called VIWHard, which takes important words as perturbed words. In order to verify the words which highly impact on the classification of the target model, we design an important-word discriminator consisting of a binary classifier and a masked language model as an important component of VIWHard. Meanwhile, we use a masked language model to construct the context-preserving sets of substitute words for important words, which further improves the naturalness of the adversarial texts. We conduct experiments by attacking WordCNN, WordLSTM and BERT on seven datasets, which contain text classification, toxic information, and sensitive information datasets. Experimental results show that our method achieves powerful attacking performance and generates natural adversarial texts. The average attack success rate on the seven datasets reaches 98.556%, and the average naturalness of the adversarial texts reaches 7.894. Specially, on the four security datasets Jigsaw2018, HSOL, EDENCE, and FAS, our average attack success rate reaches 97.663%, and the average naturalness of the adversarial texts reaches 8.626. In addition, we evaluate the attack performance of VIWHard on large language models (LLMs), the generated adversarial examples are effective for LLMs.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"616 \",\"pages\":\"Article 128917\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224016886\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016886","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在硬标签黑箱设置下，对手只获得目标模型的决策，更实用。干扰词和替代词集都会影响对抗性攻击的性能。我们提出了一个硬标签黑盒对抗攻击框架，称为viward，它将重要词作为扰动词。为了验证对目标模型分类影响较大的词，我们设计了一个由二值分类器和掩码语言模型组成的重要词鉴别器，作为viward的重要组成部分。同时，我们利用掩码语言模型构建了重要词替代词的上下文保持集，进一步提高了对抗性文本的自然度。我们在包含文本分类、有毒信息和敏感信息数据集的7个数据集上攻击WordCNN、WordLSTM和BERT进行实验。实验结果表明，该方法达到了强大的攻击性能，并生成了自然的对抗性文本。7个数据集的平均攻击成功率达到98.556%，对抗性文本的平均自然度达到7.894。特别是在Jigsaw2018、HSOL、EDENCE和FAS四个安全数据集上，我们的平均攻击成功率达到97.663%，对抗性文本的平均自然度达到8.626。此外，我们评估了viward在大型语言模型（llm）上的攻击性能，生成的对抗示例对llm是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VIWHard: Text adversarial attacks based on important-word discriminator in the hard-label black-box setting

In the hard-label black-box setting, the adversary only obtains the decision of the target model, which is more practical. Both the perturbed words and the sets of substitute words affect the performance of adversarial attack. We propose a hard-label black-box adversarial attack framework called VIWHard, which takes important words as perturbed words. In order to verify the words which highly impact on the classification of the target model, we design an important-word discriminator consisting of a binary classifier and a masked language model as an important component of VIWHard. Meanwhile, we use a masked language model to construct the context-preserving sets of substitute words for important words, which further improves the naturalness of the adversarial texts. We conduct experiments by attacking WordCNN, WordLSTM and BERT on seven datasets, which contain text classification, toxic information, and sensitive information datasets. Experimental results show that our method achieves powerful attacking performance and generates natural adversarial texts. The average attack success rate on the seven datasets reaches 98.556%, and the average naturalness of the adversarial texts reaches 7.894. Specially, on the four security datasets Jigsaw2018, HSOL, EDENCE, and FAS, our average attack success rate reaches 97.663%, and the average naturalness of the adversarial texts reaches 8.626. In addition, we evaluate the attack performance of VIWHard on large language models (LLMs), the generated adversarial examples are effective for LLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.