ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2024-03-21 DOI:10.1109/TC.2024.3404051

Myeonggu Kang;Junyoung Park;Hyein Shin;Jaekang Shin;Lee-Sup Kim

{"title":"ToEx: Accelerating Generation Stage of Transformer-Based Language Models via Token-Adaptive Early Exit","authors":"Myeonggu Kang;Junyoung Park;Hyein Shin;Jaekang Shin;Lee-Sup Kim","doi":"10.1109/TC.2024.3404051","DOIUrl":null,"url":null,"abstract":"Transformer-based language models have recently gained popularity in numerous natural language processing (NLP) applications due to their superior performance compared to traditional algorithms. These models involve two execution stages: summarization and generation. The generation stage accounts for a significant portion of the total execution time due to its auto-regressive property, which necessitates considerable and repetitive off-chip accesses. Consequently, our objective is to minimize off-chip accesses during the generation stage to expedite transformer execution. To achieve the goal, we propose a token-adaptive early exit (ToEx) that generates output tokens using fewer decoders, thereby reducing off-chip accesses for loading weight parameters. Although our approach has the potential to minimize data communication, it brings two challenges: 1) inaccurate self-attention computation, and 2) significant overhead for exit decision. To overcome these challenges, we introduce a methodology that facilitates accurate self-attention by lazily performing computations for previously exited tokens. Moreover, we mitigate the overhead of exit decision by incorporating a lightweight output embedding layer. We also present a hardware design to efficiently support the proposed work. Evaluation results demonstrate that our work can reduce the number of decoders by 2.6\n<inline-formula><tex-math>$\\times$</tex-math></inline-formula>\n on average. Accordingly, it achieves 3.2\n<inline-formula><tex-math>$\\times$</tex-math></inline-formula>\n speedup on average compared to transformer execution without our work.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 9","pages":"2248-2261"},"PeriodicalIF":3.6000,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10535998/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based language models have recently gained popularity in numerous natural language processing (NLP) applications due to their superior performance compared to traditional algorithms. These models involve two execution stages: summarization and generation. The generation stage accounts for a significant portion of the total execution time due to its auto-regressive property, which necessitates considerable and repetitive off-chip accesses. Consequently, our objective is to minimize off-chip accesses during the generation stage to expedite transformer execution. To achieve the goal, we propose a token-adaptive early exit (ToEx) that generates output tokens using fewer decoders, thereby reducing off-chip accesses for loading weight parameters. Although our approach has the potential to minimize data communication, it brings two challenges: 1) inaccurate self-attention computation, and 2) significant overhead for exit decision. To overcome these challenges, we introduce a methodology that facilitates accurate self-attention by lazily performing computations for previously exited tokens. Moreover, we mitigate the overhead of exit decision by incorporating a lightweight output embedding layer. We also present a hardware design to efficiently support the proposed work. Evaluation results demonstrate that our work can reduce the number of decoders by 2.6

$\times$

on average. Accordingly, it achieves 3.2

$\times$

speedup on average compared to transformer execution without our work.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ToEx：通过令牌自适应早期退出加速基于转换器的语言模型的生成阶段

与传统算法相比，基于变换器的语言模型性能优越，因此最近在许多自然语言处理（NLP）应用中大受欢迎。这些模型涉及两个执行阶段：总结和生成。由于其自动回归特性，生成阶段占总执行时间的很大一部分，需要大量重复的片外访问。因此，我们的目标是尽量减少生成阶段的片外访问，以加快变压器的执行。为实现这一目标，我们提出了一种标记自适应早期退出（ToEx）方法，该方法使用较少的解码器生成输出标记，从而减少了用于加载权重参数的片外访问。虽然我们的方法有可能最大限度地减少数据通信，但也带来了两个挑战：1) 自我关注计算不准确，以及 2) 退出决策开销巨大。为了克服这些挑战，我们引入了一种方法，通过对先前退出的代币懒散地执行计算，促进准确的自我关注。此外，我们还通过加入轻量级输出嵌入层来减轻退出决策的开销。我们还提出了一种硬件设计，以有效支持所提出的工作。评估结果表明，我们的工作可以将解码器的数量平均减少 2.6 美元/次。因此，与没有我们的工作的变压器执行相比，它平均实现了 3.2 美元/次的提速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.