CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2024-03-09 DOI:10.1109/TC.2024.3398488

Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He

{"title":"CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators","authors":"Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He","doi":"10.1109/TC.2024.3398488","DOIUrl":null,"url":null,"abstract":"As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 8","pages":"1924-1938"},"PeriodicalIF":3.8000,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10527401/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CoDA：多功能高效注意力加速器的协同设计框架

作为变形金刚的主要组成部分，注意力机制的计算复杂度高达四倍。为了实现高效实施，其硬件加速器设计引起了极大的研究兴趣。然而，现有的加速器大多只支持单一类型的应用和单一类型的注意力，难以满足多样化应用场景的需求。此外，它们主要关注注意力矩阵的动态剪枝，这需要部署预处理单元，从而降低了整体硬件效率。本文提出的 CoDA 是一个算法、数据流和架构协同设计框架，用于设计多功能、高效的注意力加速器。所设计的加速器支持 NLP 和 CV 应用，可配置为支持低秩注意力或低秩加稀疏注意力的模式。我们对低阶注意力进行了算法转换，以显著降低计算复杂度。为防止算法转换导致存储开销增加，我们精心设计了数据流，并采用了分块方式。架构和数据流的协同设计进一步支持了软最大值的缩减。此外，我们还提出了一种软最大共享策略，以降低面积成本。实验结果表明，所提出的加速器在吞吐量、面积效率和能效方面都优于最先进的设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.

期刊最新文献

2025 Reviewers List Evaluation of Radiation Resilience, Performance, and Vmin of Sub-3 nm FSFET Based SRAM Arrays Dual-Pronged Deep Learning Preprocessing on Heterogeneous Platforms With CPU, Accelerator and CSD Latency Optimization in Hybrid Memory System for GNNs Fused FP8 Many-Terms Dot Product With Scaling and FP32 Accumulation