Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow

IF 4.6 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Solid-state Circuits Pub Date : 2024-03-14 DOI:10.1109/JSSC.2024.3397189
Yubin Qin;Yang Wang;Dazheng Deng;Xiaolong Yang;Zhiren Zhao;Yang Zhou;Yuanqi Fan;Jingchuan Wei;Tianbao Chen;Leibo Liu;Shaojun Wei;Yang Hu;Shouyi Yin
{"title":"Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow","authors":"Yubin Qin;Yang Wang;Dazheng Deng;Xiaolong Yang;Zhiren Zhao;Yang Zhou;Yuanqi Fan;Jingchuan Wei;Tianbao Chen;Leibo Liu;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/JSSC.2024.3397189","DOIUrl":null,"url":null,"abstract":"Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20–\n<inline-formula> <tex-math>$258.9{\\times }$ </tex-math></inline-formula>\n higher than the state-of-the-art works.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10530252/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20– $258.9{\times }$ higher than the state-of-the-art works.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Ayaka:具有低方根估计和异构数据流功能的多功能变压器加速器
变压器模型在人工智能领域表现出色。然而,其出色的性能是以巨大的计算复杂性为代价的,由于功率和吞吐量的限制,从云到边缘部署变压器受到了限制。为实际任务设计变压器加速器面临两大挑战。首先,变换器会因输入长度的变化而产生不一致的瓶颈:对于短输入,例如使用视觉变换器(ViT)处理 ImageNet 或使用变换器的双向编码器表示法(BERT)处理通用语言理解评估(GLUE),模型的线性层会成为计算瓶颈。相反,对于长输入,如高分辨率图像或长文本任务,注意力计算则成为瓶颈。其次,即使输入长度给定,模型中的不同层也会表现出不同的计算特性和工作量,如矩阵大小和数据重用策略。本文介绍的 Ayaka 是一种多功能变压器加速器,旨在解决这些问题。Ayaka 采用基于随机投影 (RP) 的跨层稀疏预测方法,实现了注意力计算和线性层的同步稀疏化,从而提高了不同输入长度下各种瓶颈的吞吐量。此外,Ayaka 还利用 softmax 的输入翻译不变性优化了稀疏注意力计算。此外,Ayaka 还采用了异构数据流处理元件 (HDPE) 设计,可根据当前计算动态调整固定矩阵操作数,从而最大限度地提高片上数据重用率并减少内存占用。凭借这些特性,Ayaka 是迄今为止第一款能够加速整个注意力层的加速器。对12个典型模型和任务的评估表明,它实现了49.7 TOPS/W的峰值能效,比最先进的作品高出1.20- $258.9{\times }$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Journal of Solid-state Circuits
IEEE Journal of Solid-state Circuits 工程技术-工程:电子与电气
CiteScore
11.00
自引率
20.40%
发文量
351
审稿时长
3-6 weeks
期刊介绍: The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.
期刊最新文献
Power-Limited Inference Performance Optimization Using a Software-Assisted Peak Current Regulation Scheme in a 5-nm AI SoC Background Noise and Process-Variation-Tolerant Sub-Microwatt Keyword Spotting Hardware Featuring Spike-Domain Division-Based Energy Normalization A 150-GHz Single-to-Differential LNA Adopting Wideband $G_\text{max}$-Cores Based on Single-Ended Compact Lumped L-C-L and Differential Coupled-Line Embedding Networks A 28-nm 16-kb Aggregation and Combination Computing-in-Memory Macro With Dual-Level Sparsity Modulation and Sparse-Tracking ADCs for GCNs A 100 $\times$ 80 Flash LiDAR Sensor With In-Pixel Zoom-Histogramming TDC and Self-Referenced Single-Slope ADC Based on Analog Counters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1