A 3-D Multi-Precision Scalable Systolic FMA Architecture

IF 5.2 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2024-11-19 DOI:10.1109/TCSI.2024.3497724
Haotian Liu;Xicheng Lu;Xiaoyu Yu;Kai Li;Kaiyuan Yang;Haihang Xia;Sizhao Li;Tiantai Deng
{"title":"A 3-D Multi-Precision Scalable Systolic FMA Architecture","authors":"Haotian Liu;Xicheng Lu;Xiaoyu Yu;Kai Li;Kaiyuan Yang;Haihang Xia;Sizhao Li;Tiantai Deng","doi":"10.1109/TCSI.2024.3497724","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 1","pages":"265-276"},"PeriodicalIF":5.2000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10758343/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种三维多精度可伸缩收缩FMA结构
人工智能(AI)几乎已经成为计算机视觉、聊天机器人和自然语言处理等广泛应用的默认方法。这些基于人工智能的应用程序需要在有限的时间窗口内以足够的精度计算大规模数据,通常是浮点数。人工智能加速的主要目标是矩阵乘法,主要涉及通过乘法累积(MAC)操作的点积。目前的研究采用基于IEEE-754浮点(FP)标准的融合乘加(FMA)运算来满足这些要求。然而,目前的研究更多地侧重于简化执行FMA操作的处理元件(pe)的内部数字电路,而不是针对MAC任务优化FMA过程。目前的PE阵列通常采用二维(2-D)收缩阵列设计,没有针对MAC操作进行特定的优化,因此其并行性没有得到充分利用。此外,这些设计缺乏可重构性和灵活性,导致现场可编程门阵列(fpga)的性能不佳。此外,一些设计在人工智能推理中采用较低精度的计算来获得更高的性能。然而,一些人工智能模型仍然依靠高精度计算来保持精度。因此,多精度计算在人工智能加速器中被广泛使用。为了解决这些问题,本文提出了一种新的多融合乘法-累积(MFMA)方案和相应的三维(3-D)可扩展收缩FP计算体系结构。MFMA方案解决了经典FMA方案的问题。它通过融合乘法累积(FMAC)操作优化了MAC操作的FMA。并结合多精度和混合精度FP计算方法,提高了计算精度,减小了溢出误差。该架构将两个二维收缩阵列集成到PE中,形成一个三维收缩阵列,实现了更高的并行性和灵活性。所提出的可扩展架构可以定制以适应各种FMAC操作。与现有fpga上最先进的FP架构相比,我们提出的架构在FP32、FP16和INT8操作中分别实现了47%、10%和159%的能效提升。此外,我们提出的架构在效率饱和条件下实现了105%,54%和262%的能源效率改进,优于现有的最先进的设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Circuits and Systems I: Regular Papers
IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程:电子与电气
CiteScore
9.80
自引率
11.80%
发文量
441
审稿时长
2 months
期刊介绍: TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.
期刊最新文献
Table of Contents IEEE Circuits and Systems Society Information IEEE Transactions on Circuits and Systems--I: Regular Papers Information for Authors IEEE Transactions on Circuits and Systems--I: Regular Papers Publication Information Guest Editorial Special Issue on Emerging Hardware Security and Trust Technologies—AsianHOST 2023
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1