A Multiple-Precision Multiply and Accumulation Design with Multiply-Add Merged Strategy for AI Accelerating

2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC) Pub Date : 2021-01-18 DOI:10.1145/3394885.3431531

Song Zhang, Jiangyuan Gu, S. Yin, Leibo Liu, Shaojun Wei

{"title":"A Multiple-Precision Multiply and Accumulation Design with Multiply-Add Merged Strategy for AI Accelerating","authors":"Song Zhang, Jiangyuan Gu, S. Yin, Leibo Liu, Shaojun Wei","doi":"10.1145/3394885.3431531","DOIUrl":null,"url":null,"abstract":"Multiply and accumulations(MAC) are fundamental operations for domain-specific accelerator with AI applications ranging from filtering to convolutional neural networks(CNN). This paper proposes an energy-efficient MAC design, supporting a wide range of bit- width, for both signed and unsigned operands. Firstly, based on the classic Booth algorithm, we propose the Booth algorithm to propose a multiply-add merged strategy. The design can not only support both signed and unsigned operations but also eliminate the delay, area and power overheads from the adder of traditional MAC units. Then a multiply-add merged design method for flexible bit-width adjustment is proposed using the fusion strategy. In addition, treating the addend as a partial product makes the operation easy to pipeline and balanced. The comprehensive improvement in delay, area and power can meet various requirements from different applications and hardware design. By using the proposed method, we have synthesized MAC units for several operation modes using a SMIC 40-nm library. Comparison with other MAC designs shows that the proposed design method can achieve up to 24.1% and 28.2% PDP and ADP improvement for bit-width fixed MAC designs, and 28.43% ~ 38.16% for bit-width adjustable ones. When pipelined, the design has decreased the latency by more than 13%. The improvement in power and area is up to 8.0% and 8.1% respectively.","PeriodicalId":186307,"journal":{"name":"2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3394885.3431531","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Multiply and accumulations(MAC) are fundamental operations for domain-specific accelerator with AI applications ranging from filtering to convolutional neural networks(CNN). This paper proposes an energy-efficient MAC design, supporting a wide range of bit- width, for both signed and unsigned operands. Firstly, based on the classic Booth algorithm, we propose the Booth algorithm to propose a multiply-add merged strategy. The design can not only support both signed and unsigned operations but also eliminate the delay, area and power overheads from the adder of traditional MAC units. Then a multiply-add merged design method for flexible bit-width adjustment is proposed using the fusion strategy. In addition, treating the addend as a partial product makes the operation easy to pipeline and balanced. The comprehensive improvement in delay, area and power can meet various requirements from different applications and hardware design. By using the proposed method, we have synthesized MAC units for several operation modes using a SMIC 40-nm library. Comparison with other MAC designs shows that the proposed design method can achieve up to 24.1% and 28.2% PDP and ADP improvement for bit-width fixed MAC designs, and 28.43% ~ 38.16% for bit-width adjustable ones. When pipelined, the design has decreased the latency by more than 13%. The improvement in power and area is up to 8.0% and 8.1% respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于乘加合并策略的人工智能加速多精度乘累加设计

乘法和累积(MAC)是特定领域加速器的基本操作，具有从过滤到卷积神经网络(CNN)的人工智能应用。本文提出了一种节能的MAC设计，支持大范围的位宽，适用于有符号和无符号操作数。首先，在经典Booth算法的基础上，提出Booth算法，提出一种乘加合并策略。该设计不仅支持有符号和无符号操作，而且消除了传统MAC单元加法器的延迟、面积和功耗开销。然后利用融合策略提出了一种灵活位宽调整的乘加合并设计方法。此外，将加数作为部分乘积处理，使操作易于流水线化和平衡。在时延、面积、功耗等方面的全面提升，可以满足不同应用和硬件设计的各种需求。利用所提出的方法，我们利用SMIC 40纳米库合成了几种工作模式的MAC单元。与其他MAC设计的比较表明，该设计方法对位宽固定MAC设计的PDP和ADP提高了24.1%和28.2%，对位宽可调MAC设计的PDP和ADP提高了28.43% ~ 38.16%。当采用流水线时，该设计将延迟降低了13%以上。功率和面积分别提高了8.0%和8.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)

自引率

0.00%

发文量