Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow

IF 4.6 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Solid-state Circuits Pub Date : 2024-03-14 DOI:10.1109/JSSC.2024.3397189

Yubin Qin;Yang Wang;Dazheng Deng;Xiaolong Yang;Zhiren Zhao;Yang Zhou;Yuanqi Fan;Jingchuan Wei;Tianbao Chen;Leibo Liu;Shaojun Wei;Yang Hu;Shouyi Yin

{"title":"Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow","authors":"Yubin Qin;Yang Wang;Dazheng Deng;Xiaolong Yang;Zhiren Zhao;Yang Zhou;Yuanqi Fan;Jingchuan Wei;Tianbao Chen;Leibo Liu;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/JSSC.2024.3397189","DOIUrl":null,"url":null,"abstract":"Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20–\n<inline-formula> <tex-math>$258.9{\\times }$ </tex-math></inline-formula>\n higher than the state-of-the-art works.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10530252/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20–

$258.9{\times }$

higher than the state-of-the-art works.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Ayaka：具有低方根估计和异构数据流功能的多功能变压器加速器

变压器模型在人工智能领域表现出色。然而，其出色的性能是以巨大的计算复杂性为代价的，由于功率和吞吐量的限制，从云到边缘部署变压器受到了限制。为实际任务设计变压器加速器面临两大挑战。首先，变换器会因输入长度的变化而产生不一致的瓶颈：对于短输入，例如使用视觉变换器（ViT）处理 ImageNet 或使用变换器的双向编码器表示法（BERT）处理通用语言理解评估（GLUE），模型的线性层会成为计算瓶颈。相反，对于长输入，如高分辨率图像或长文本任务，注意力计算则成为瓶颈。其次，即使输入长度给定，模型中的不同层也会表现出不同的计算特性和工作量，如矩阵大小和数据重用策略。本文介绍的 Ayaka 是一种多功能变压器加速器，旨在解决这些问题。Ayaka 采用基于随机投影 (RP) 的跨层稀疏预测方法，实现了注意力计算和线性层的同步稀疏化，从而提高了不同输入长度下各种瓶颈的吞吐量。此外，Ayaka 还利用 softmax 的输入翻译不变性优化了稀疏注意力计算。此外，Ayaka 还采用了异构数据流处理元件 (HDPE) 设计，可根据当前计算动态调整固定矩阵操作数，从而最大限度地提高片上数据重用率并减少内存占用。凭借这些特性，Ayaka 是迄今为止第一款能够加速整个注意力层的加速器。对12个典型模型和任务的评估表明，它实现了49.7 TOPS/W的峰值能效，比最先进的作品高出1.20- $258.9{\times }$。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal of Solid-state Circuits 工程技术-工程：电子与电气

CiteScore

11.00

自引率

20.40%

发文量

351

审稿时长

3-6 weeks

期刊介绍： The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.