Causality Analysis for Evaluating the Security of Large Language Models

arXiv - CS - Artificial Intelligence Pub Date : 2023-12-13 DOI:arxiv-2312.07876

Wei Zhao, Zhe Li, Jun Sun

{"title":"Causality Analysis for Evaluating the Security of Large Language Models","authors":"Wei Zhao, Zhe Li, Jun Sun","doi":"arxiv-2312.07876","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted\nin many safety-critical applications. Their security is thus essential. Even\nwith considerable efforts spent on reinforcement learning from human feedback\n(RLHF), recent studies have shown that LLMs are still subject to attacks such\nas adversarial perturbation and Trojan attacks. Further research is thus needed\nto evaluate their security and/or understand the lack of it. In this work, we\npropose a framework for conducting light-weight causality-analysis of LLMs at\nthe token, layer, and neuron level. We applied our framework to open-source\nLLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based\non a layer-level causality analysis, we show that RLHF has the effect of\noverfitting a model to harmful prompts. It implies that such security can be\neasily overcome by `unusual' harmful prompts. As evidence, we propose an\nadversarial perturbation method that achieves 100\\% attack success rate on the\nred-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we\nshow the existence of one mysterious neuron in both Llama2 and Vicuna that has\nan unreasonably high causal effect on the output. While we are uncertain on why\nsuch a neuron exists, we show that it is possible to conduct a ``Trojan''\nattack targeting that particular neuron to completely cripple the LLM, i.e., we\ncan generate transferable suffixes to prompts that frequently make the LLM\nproduce meaningless responses.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.07876","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted in many safety-critical applications. Their security is thus essential. Even with considerable efforts spent on reinforcement learning from human feedback (RLHF), recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks. Further research is thus needed to evaluate their security and/or understand the lack of it. In this work, we propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts. It implies that such security can be easily overcome by `unusual' harmful prompts. As evidence, we propose an adversarial perturbation method that achieves 100\% attack success rate on the red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we show the existence of one mysterious neuron in both Llama2 and Vicuna that has an unreasonably high causal effect on the output. While we are uncertain on why such a neuron exists, we show that it is possible to conduct a ``Trojan'' attack targeting that particular neuron to completely cripple the LLM, i.e., we can generate transferable suffixes to prompts that frequently make the LLM produce meaningless responses.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估大型语言模型安全性的因果关系分析

GPT 和 Llama2 等大型语言模型（LLM）越来越多地被许多安全关键型应用所采用。因此，它们的安全性至关重要。尽管在人类反馈强化学习（RLHF）方面花费了大量精力，但最近的研究表明，LLM 仍然会受到诸如对抗性扰动和木马攻击等攻击。因此，需要进一步研究来评估其安全性和/或了解其安全性的不足。在这项工作中，我们提出了一个在标记、层和神经元层面对 LLM 进行轻量级因果分析的框架。我们将这一框架应用于 Llama2 和 Vicuna 等开源 LLM，并取得了多项有趣的发现。基于层级因果关系分析，我们发现 RLHF 会使模型过度拟合有害提示。这意味着这种安全性很容易被 "非同寻常 "的有害提示所克服。作为证据，我们提出了一种对抗性扰动方法，该方法在2023年木马检测竞赛的组队任务中达到了100%的攻击成功率。此外，我们还展示了在 Llama2 和 Vicuna 中都存在一个神秘的神经元，它对输出具有不合理的高因果效应。虽然我们还不确定为什么会存在这样一个神经元，但我们展示了针对该神经元进行 "特洛伊木马 "攻击以彻底瘫痪 LLM 的可能性，也就是说，我们可以生成可转移的提示后缀，使 LLM 频繁做出毫无意义的回应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量