LLMEffiChecker:Understanding and Testing Efficiency Degradation of Large Language Models

IF 6.6 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Software Engineering and Methodology Pub Date : 2024-05-13 DOI:10.1145/3664812

Xiaoning Feng, Xiaohong Han, Simin Chen, Wei Yang

{"title":"LLMEffiChecker:Understanding and Testing Efficiency Degradation of Large Language Models","authors":"Xiaoning Feng, Xiaohong Han, Simin Chen, Wei Yang","doi":"10.1145/3664812","DOIUrl":null,"url":null,"abstract":"<p>Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of LLMs, which is of paramount importance due to often vast generation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of LLMs instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present <monospace>LLMEffiChecker</monospace>, which can work under both white-box setting and black-box setting. In the white-box scenario, <monospace>LLMEffiChecker</monospace> develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario, <monospace>LLMEffiChecker</monospace> employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of <monospace>LLMEffiChecker</monospace>, we conduct a systematic evaluation on nine public-available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that <monospace>LLMEffiChecker</monospace> can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to 3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by <monospace>LLMEffiChecker</monospace> significantly affect the battery power in real-world mobile devices (<i>i.e.</i>, drain more than 30 times battery power than normal inputs).</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"7 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664812","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of LLMs, which is of paramount importance due to often vast generation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of LLMs instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present LLMEffiChecker, which can work under both white-box setting and black-box setting. In the white-box scenario, LLMEffiChecker develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario, LLMEffiChecker employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of LLMEffiChecker, we conduct a systematic evaluation on nine public-available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that LLMEffiChecker can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to 3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by LLMEffiChecker significantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LLMEffiChecker：理解和测试大型语言模型的效率衰减

大型语言模型（LLM）因其人类水平的准确性而受到广泛关注。现有的研究大多集中在提高准确性或测试准确性的鲁棒性上，而 LLM 的计算效率却出人意料地很少受到关注，而计算效率往往因庞大的生成需求和实时性要求而至关重要。在本文中，我们首次尝试了解和测试最先进 LLM 潜在的计算效率鲁棒性。通过分析 20,543 个可公开访问的 LLM 的工作机制和实现，我们观察到了 LLM 的一个基本特性，它可以被恶意操纵，从而显著降低计算效率。我们有趣的发现是，输出长度而非输入决定了 LLM 的计算效率，其中输出长度取决于两个因素：一个通常足够大但却很悲观的预设阈值（控制最大迭代次数）和一个运行时生成的句末标记（EOS）。我们的主要动机是生成能够充分延迟 EOS 生成的测试输入，这样 LLM 就必须经过足够多的迭代才能满足预先配置的阈值。我们提出的 LLMEffiChecker 可以在白盒和黑盒环境下工作。在白箱环境下，LLMEffiChecker 开发了一种梯度引导技术，可在字符级、标记级和结构级搜索最小且不易察觉的扰动。在黑盒方案中，LLMEffiChecker 采用基于因果推理的方法来查找关键标记，并同样对它们应用三个不易察觉的扰动级别。白盒和黑盒设置都能有效延迟 EOS 的出现，迫使这些输入达到自然无法达到的阈值。为了证明 LLMEffiChecker 的有效性，我们对九个公开的 LLM 进行了系统评估：谷歌 T5、AllenAI WMT14、赫尔辛基-NLP 翻译器、Facebook FairSeq、UNICAMP-DL 翻译器、MarianMT、谷歌 FLAN-T5、MBZUAI LaMini-GPT 和 Salesforce CodeGen。实验结果表明，LLMEffiChecker 只需扰动输入句子中的一个字符或标记，就能将 LLM 的响应延迟和能耗平均分别提高 325% 至 3244%，以及 344% 至 3616%。我们的案例研究表明，LLMEffiChecker 生成的输入会严重影响实际移动设备的电池电量（即耗电量是正常输入的 30 倍以上）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Software Engineering and Methodology 工程技术-计算机：软件工程

CiteScore

6.30

自引率

4.50%

发文量

164

审稿时长

>12 weeks

期刊介绍： Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.