An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2024-01-15 DOI:10.1145/3640469

Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang

{"title":"An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration","authors":"Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang","doi":"10.1145/3640469","DOIUrl":null,"url":null,"abstract":"Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (NCIMD) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to \\(40\\% \\) in our proposed CIM FPGA architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"14 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3640469","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (NCIMD) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to \(40\% \) in our proposed CIM FPGA architecture.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于深度学习加速的全数字内存计算 FPGA 架构

现场可编程门阵列（FPGA）是一种通用的可编程硬件平台，这使其成为加速深度神经网络（DNN）的理想选择。然而，FPGA 的计算能效较低，原因是能耗主要由互连数据移动造成。在本文中，我们提出了一种用于深度学习加速的全数字内存计算 FPGA 架构。此外，我们还提出了数字 CIM 内核的位串行计算电路，用于加速向量矩阵乘法（VMM）运算。我们还开发了网络-CIM-部署器（NCIMD），以支持 DNN 网络的自动部署和映射。NCIMD 提供了 Caffe 格式 DNN 模型的用户友好 API。同时，我们引入了权重静态（WS）数据流，并介绍了将单层网络映射到该架构中的 CIM 阵列的方法。我们使用不同的架构布局和映射策略，在深度学习（DL）领域和非 DL 领域对所提出的 FPGA 架构进行了实验测试。我们还将测试结果与传统的 FPGA 架构进行了比较。实验结果表明，与传统的FPGA架构相比，在我们提出的CIM FPGA架构中，能效最高可实现16.1倍的提速，而时延最高可降低40%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Reconfigurable Technology and Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.90

自引率

8.70%

发文量

审稿时长

>12 weeks

期刊介绍： TRETS is the top journal focusing on research in, on, and with reconfigurable systems and on their underlying technology. The scope, rationale, and coverage by other journals are often limited to particular aspects of reconfigurable technology or reconfigurable systems. TRETS is a journal that covers reconfigurability in its own right. Topics that would be appropriate for TRETS would include all levels of reconfigurable system abstractions and all aspects of reconfigurable technology including platforms, programming environments and application successes that support these systems for computing or other applications. -The board and systems architectures of a reconfigurable platform. -Programming environments of reconfigurable systems, especially those designed for use with reconfigurable systems that will lead to increased programmer productivity. -Languages and compilers for reconfigurable systems. -Logic synthesis and related tools, as they relate to reconfigurable systems. -Applications on which success can be demonstrated. The underlying technology from which reconfigurable systems are developed. (Currently this technology is that of FPGAs, but research on the nature and use of follow-on technologies is appropriate for TRETS.) In considering whether a paper is suitable for TRETS, the foremost question should be whether reconfigurability has been essential to success. Topics such as architecture, programming languages, compilers, and environments, logic synthesis, and high performance applications are all suitable if the context is appropriate. For example, an architecture for an embedded application that happens to use FPGAs is not necessarily suitable for TRETS, but an architecture using FPGAs for which the reconfigurability of the FPGAs is an inherent part of the specifications (perhaps due to a need for re-use on multiple applications) would be appropriate for TRETS.