CoDA:多功能高效注意力加速器的协同设计框架

IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2024-03-09 DOI:10.1109/TC.2024.3398488
Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He
{"title":"CoDA:多功能高效注意力加速器的协同设计框架","authors":"Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He","doi":"10.1109/TC.2024.3398488","DOIUrl":null,"url":null,"abstract":"As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 8","pages":"1924-1938"},"PeriodicalIF":3.6000,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators\",\"authors\":\"Wenjie Li;Aokun Hu;Ningyi Xu;Guanghui He\",\"doi\":\"10.1109/TC.2024.3398488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"73 8\",\"pages\":\"1924-1938\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10527401/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10527401/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

作为变形金刚的主要组成部分,注意力机制的计算复杂度高达四倍。为了实现高效实施,其硬件加速器设计引起了极大的研究兴趣。然而,现有的加速器大多只支持单一类型的应用和单一类型的注意力,难以满足多样化应用场景的需求。此外,它们主要关注注意力矩阵的动态剪枝,这需要部署预处理单元,从而降低了整体硬件效率。本文提出的 CoDA 是一个算法、数据流和架构协同设计框架,用于设计多功能、高效的注意力加速器。所设计的加速器支持 NLP 和 CV 应用,可配置为支持低秩注意力或低秩加稀疏注意力的模式。我们对低阶注意力进行了算法转换,以显著降低计算复杂度。为防止算法转换导致存储开销增加,我们精心设计了数据流,并采用了分块方式。架构和数据流的协同设计进一步支持了软最大值的缩减。此外,我们还提出了一种软最大共享策略,以降低面积成本。实验结果表明,所提出的加速器在吞吐量、面积效率和能效方面都优于最先进的设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators
As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
期刊最新文献
CUSPX: Efficient GPU Implementations of Post-Quantum Signature SPHINCS+ Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning FLALM: A Flexible Low Area-Latency Montgomery Modular Multiplication on FPGA Novel Lagrange Multipliers-Driven Adaptive Offloading for Vehicular Edge Computing Leveraging GPU in Homomorphic Encryption: Framework Design and Analysis of BFV Variants
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1