Haotian Liu;Xicheng Lu;Xiaoyu Yu;Kai Li;Kaiyuan Yang;Haihang Xia;Sizhao Li;Tiantai Deng
{"title":"A 3-D Multi-Precision Scalable Systolic FMA Architecture","authors":"Haotian Liu;Xicheng Lu;Xiaoyu Yu;Kai Li;Kaiyuan Yang;Haihang Xia;Sizhao Li;Tiantai Deng","doi":"10.1109/TCSI.2024.3497724","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 1","pages":"265-276"},"PeriodicalIF":5.2000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10758343/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.