A 3-D Multi-Precision Scalable Systolic FMA Architecture

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2024-11-19 DOI:10.1109/TCSI.2024.3497724

Haotian Liu;Xicheng Lu;Xiaoyu Yu;Kai Li;Kaiyuan Yang;Haihang Xia;Sizhao Li;Tiantai Deng

{"title":"A 3-D Multi-Precision Scalable Systolic FMA Architecture","authors":"Haotian Liu;Xicheng Lu;Xiaoyu Yu;Kai Li;Kaiyuan Yang;Haihang Xia;Sizhao Li;Tiantai Deng","doi":"10.1109/TCSI.2024.3497724","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 1","pages":"265-276"},"PeriodicalIF":5.2000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10758343/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种三维多精度可伸缩收缩FMA结构

人工智能（AI）几乎已经成为计算机视觉、聊天机器人和自然语言处理等广泛应用的默认方法。这些基于人工智能的应用程序需要在有限的时间窗口内以足够的精度计算大规模数据，通常是浮点数。人工智能加速的主要目标是矩阵乘法，主要涉及通过乘法累积（MAC）操作的点积。目前的研究采用基于IEEE-754浮点（FP）标准的融合乘加（FMA）运算来满足这些要求。然而，目前的研究更多地侧重于简化执行FMA操作的处理元件（pe）的内部数字电路，而不是针对MAC任务优化FMA过程。目前的PE阵列通常采用二维（2-D）收缩阵列设计，没有针对MAC操作进行特定的优化，因此其并行性没有得到充分利用。此外，这些设计缺乏可重构性和灵活性，导致现场可编程门阵列（fpga）的性能不佳。此外，一些设计在人工智能推理中采用较低精度的计算来获得更高的性能。然而，一些人工智能模型仍然依靠高精度计算来保持精度。因此，多精度计算在人工智能加速器中被广泛使用。为了解决这些问题，本文提出了一种新的多融合乘法-累积（MFMA）方案和相应的三维（3-D）可扩展收缩FP计算体系结构。MFMA方案解决了经典FMA方案的问题。它通过融合乘法累积（FMAC）操作优化了MAC操作的FMA。并结合多精度和混合精度FP计算方法，提高了计算精度，减小了溢出误差。该架构将两个二维收缩阵列集成到PE中，形成一个三维收缩阵列，实现了更高的并行性和灵活性。所提出的可扩展架构可以定制以适应各种FMAC操作。与现有fpga上最先进的FP架构相比，我们提出的架构在FP32、FP16和INT8操作中分别实现了47%、10%和159%的能效提升。此外，我们提出的架构在效率饱和条件下实现了105%，54%和262%的能源效率改进，优于现有的最先进的设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.