{"title":"Causality Analysis for Evaluating the Security of Large Language Models","authors":"Wei Zhao, Zhe Li, Jun Sun","doi":"arxiv-2312.07876","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted\nin many safety-critical applications. Their security is thus essential. Even\nwith considerable efforts spent on reinforcement learning from human feedback\n(RLHF), recent studies have shown that LLMs are still subject to attacks such\nas adversarial perturbation and Trojan attacks. Further research is thus needed\nto evaluate their security and/or understand the lack of it. In this work, we\npropose a framework for conducting light-weight causality-analysis of LLMs at\nthe token, layer, and neuron level. We applied our framework to open-source\nLLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based\non a layer-level causality analysis, we show that RLHF has the effect of\noverfitting a model to harmful prompts. It implies that such security can be\neasily overcome by `unusual' harmful prompts. As evidence, we propose an\nadversarial perturbation method that achieves 100\\% attack success rate on the\nred-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we\nshow the existence of one mysterious neuron in both Llama2 and Vicuna that has\nan unreasonably high causal effect on the output. While we are uncertain on why\nsuch a neuron exists, we show that it is possible to conduct a ``Trojan''\nattack targeting that particular neuron to completely cripple the LLM, i.e., we\ncan generate transferable suffixes to prompts that frequently make the LLM\nproduce meaningless responses.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.07876","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted
in many safety-critical applications. Their security is thus essential. Even
with considerable efforts spent on reinforcement learning from human feedback
(RLHF), recent studies have shown that LLMs are still subject to attacks such
as adversarial perturbation and Trojan attacks. Further research is thus needed
to evaluate their security and/or understand the lack of it. In this work, we
propose a framework for conducting light-weight causality-analysis of LLMs at
the token, layer, and neuron level. We applied our framework to open-source
LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based
on a layer-level causality analysis, we show that RLHF has the effect of
overfitting a model to harmful prompts. It implies that such security can be
easily overcome by `unusual' harmful prompts. As evidence, we propose an
adversarial perturbation method that achieves 100\% attack success rate on the
red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we
show the existence of one mysterious neuron in both Llama2 and Vicuna that has
an unreasonably high causal effect on the output. While we are uncertain on why
such a neuron exists, we show that it is possible to conduct a ``Trojan''
attack targeting that particular neuron to completely cripple the LLM, i.e., we
can generate transferable suffixes to prompts that frequently make the LLM
produce meaningless responses.