Causality Analysis for Evaluating the Security of Large Language Models

Wei Zhao, Zhe Li, Jun Sun
{"title":"Causality Analysis for Evaluating the Security of Large Language Models","authors":"Wei Zhao, Zhe Li, Jun Sun","doi":"arxiv-2312.07876","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted\nin many safety-critical applications. Their security is thus essential. Even\nwith considerable efforts spent on reinforcement learning from human feedback\n(RLHF), recent studies have shown that LLMs are still subject to attacks such\nas adversarial perturbation and Trojan attacks. Further research is thus needed\nto evaluate their security and/or understand the lack of it. In this work, we\npropose a framework for conducting light-weight causality-analysis of LLMs at\nthe token, layer, and neuron level. We applied our framework to open-source\nLLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based\non a layer-level causality analysis, we show that RLHF has the effect of\noverfitting a model to harmful prompts. It implies that such security can be\neasily overcome by `unusual' harmful prompts. As evidence, we propose an\nadversarial perturbation method that achieves 100\\% attack success rate on the\nred-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we\nshow the existence of one mysterious neuron in both Llama2 and Vicuna that has\nan unreasonably high causal effect on the output. While we are uncertain on why\nsuch a neuron exists, we show that it is possible to conduct a ``Trojan''\nattack targeting that particular neuron to completely cripple the LLM, i.e., we\ncan generate transferable suffixes to prompts that frequently make the LLM\nproduce meaningless responses.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.07876","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted in many safety-critical applications. Their security is thus essential. Even with considerable efforts spent on reinforcement learning from human feedback (RLHF), recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks. Further research is thus needed to evaluate their security and/or understand the lack of it. In this work, we propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts. It implies that such security can be easily overcome by `unusual' harmful prompts. As evidence, we propose an adversarial perturbation method that achieves 100\% attack success rate on the red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we show the existence of one mysterious neuron in both Llama2 and Vicuna that has an unreasonably high causal effect on the output. While we are uncertain on why such a neuron exists, we show that it is possible to conduct a ``Trojan'' attack targeting that particular neuron to completely cripple the LLM, i.e., we can generate transferable suffixes to prompts that frequently make the LLM produce meaningless responses.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估大型语言模型安全性的因果关系分析
GPT 和 Llama2 等大型语言模型(LLM)越来越多地被许多安全关键型应用所采用。因此,它们的安全性至关重要。尽管在人类反馈强化学习(RLHF)方面花费了大量精力,但最近的研究表明,LLM 仍然会受到诸如对抗性扰动和木马攻击等攻击。因此,需要进一步研究来评估其安全性和/或了解其安全性的不足。在这项工作中,我们提出了一个在标记、层和神经元层面对 LLM 进行轻量级因果分析的框架。我们将这一框架应用于 Llama2 和 Vicuna 等开源 LLM,并取得了多项有趣的发现。基于层级因果关系分析,我们发现 RLHF 会使模型过度拟合有害提示。这意味着这种安全性很容易被 "非同寻常 "的有害提示所克服。作为证据,我们提出了一种对抗性扰动方法,该方法在2023年木马检测竞赛的组队任务中达到了100%的攻击成功率。此外,我们还展示了在 Llama2 和 Vicuna 中都存在一个神秘的神经元,它对输出具有不合理的高因果效应。虽然我们还不确定为什么会存在这样一个神经元,但我们展示了针对该神经元进行 "特洛伊木马 "攻击以彻底瘫痪 LLM 的可能性,也就是说,我们可以生成可转移的提示后缀,使 LLM 频繁做出毫无意义的回应。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Abductive explanations of classifiers under constraints: Complexity and properties Explaining Non-monotonic Normative Reasoning using Argumentation Theory with Deontic Logic Towards Explainable Goal Recognition Using Weight of Evidence (WoE): A Human-Centered Approach A Metric Hybrid Planning Approach to Solving Pandemic Planning Problems with Simple SIR Models Neural Networks for Vehicle Routing Problem
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1