采用双静态数据流和维度重塑策略的层智混合模式 CNN 处理架构

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2024-08-01 DOI:10.1109/TCSI.2024.3434706

Bo Liu;Xinxiang Huang;Yang Zhang;Guang Yang;Han Yan;Chen Zhang;Zejv Li;Yuanhao Wang;Hao Cai

{"title":"采用双静态数据流和维度重塑策略的层智混合模式 CNN 处理架构","authors":"Bo Liu;Xinxiang Huang;Yang Zhang;Guang Yang;Han Yan;Chen Zhang;Zejv Li;Yuanhao Wang;Hao Cai","doi":"10.1109/TCSI.2024.3434706","DOIUrl":null,"url":null,"abstract":"With the development of convolutional neural networks (CNN) across various domains, the growth in network structure complexity and computational load has increasingly become a research focus in the deployment of neural networks. The key to current research on neural network accelerators lies in striking a balance between computational accuracy and energy efficiency. This paper proposes a software-hardware co-design to strike the balance for CNN edge applications. On the hardware side, a 3-dimensional tensor engine (3D-TE), achieved with reconfigurable Tensor Processing Units (TPUs), is introduced for efficient convolution computation. We optimize the CNN dataflow on 3D-TE using a dimension reshaping method for feature maps rearrangement, and a double stationary dataflow scheduling to reduce memory access. This paper adopts a configurable approximate multiplier design based on Boolean Matrix Factorization (BMF) based logic synthesis applied in the architecture of TPU. The proposed 3D-TE, characterized by its configurable precision, enables the TPUs to dynamically adapt the bitwidth of features and weights in response to varying precision requirements. On the software side, a hessian-guided layer precision mapping is adopted to reduce unnecessary computational overhead, and a progressive re-training approach is proposed to enable a better approximation configuration and higher power reduction. Fabricated on 28-nm CMOS, this work achieves an optimized energy efficiency of 14.9 TOPS/W and 12.1 TOPS/W for ResNet56 and MobileNetV2 respectively, with 0.6V supply voltage and 150MHz clock frequency, representing an improvement of \n<inline-formula> <tex-math>$1.33\\times \\sim 8.28\\times $ </tex-math></inline-formula>\n over the state-of-the-art works.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":null,"pages":null},"PeriodicalIF":5.2000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy\",\"authors\":\"Bo Liu;Xinxiang Huang;Yang Zhang;Guang Yang;Han Yan;Chen Zhang;Zejv Li;Yuanhao Wang;Hao Cai\",\"doi\":\"10.1109/TCSI.2024.3434706\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the development of convolutional neural networks (CNN) across various domains, the growth in network structure complexity and computational load has increasingly become a research focus in the deployment of neural networks. The key to current research on neural network accelerators lies in striking a balance between computational accuracy and energy efficiency. This paper proposes a software-hardware co-design to strike the balance for CNN edge applications. On the hardware side, a 3-dimensional tensor engine (3D-TE), achieved with reconfigurable Tensor Processing Units (TPUs), is introduced for efficient convolution computation. We optimize the CNN dataflow on 3D-TE using a dimension reshaping method for feature maps rearrangement, and a double stationary dataflow scheduling to reduce memory access. This paper adopts a configurable approximate multiplier design based on Boolean Matrix Factorization (BMF) based logic synthesis applied in the architecture of TPU. The proposed 3D-TE, characterized by its configurable precision, enables the TPUs to dynamically adapt the bitwidth of features and weights in response to varying precision requirements. On the software side, a hessian-guided layer precision mapping is adopted to reduce unnecessary computational overhead, and a progressive re-training approach is proposed to enable a better approximation configuration and higher power reduction. Fabricated on 28-nm CMOS, this work achieves an optimized energy efficiency of 14.9 TOPS/W and 12.1 TOPS/W for ResNet56 and MobileNetV2 respectively, with 0.6V supply voltage and 150MHz clock frequency, representing an improvement of \\n<inline-formula> <tex-math>$1.33\\\\times \\\\sim 8.28\\\\times $ </tex-math></inline-formula>\\n over the state-of-the-art works.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10620270/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10620270/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

随着卷积神经网络（CNN）在各个领域的发展，网络结构复杂度和计算负荷的增长日益成为神经网络部署的研究重点。当前神经网络加速器研究的关键在于如何在计算精度和能效之间取得平衡。本文提出了一种软硬件协同设计方案，以实现 CNN 边缘应用的平衡。在硬件方面，我们引入了利用可重构张量处理单元（TPU）实现的三维张量引擎（3D-TE），以实现高效卷积计算。我们使用维度重塑方法对三维张量引擎上的 CNN 数据流进行优化，以重新排列特征图，并使用双静态数据流调度来减少内存访问。本文采用了基于布尔矩阵分解（BMF）的可配置近似乘法器设计，该设计基于应用于 TPU 架构的逻辑合成。所提出的 3D-TE 具有可配置精度的特点，使 TPU 能够根据不同的精度要求动态调整特征和权重的位宽。在软件方面，采用了海斯指导层精度映射来减少不必要的计算开销，并提出了渐进式再训练方法，以实现更好的近似配置和更高的功耗降低。这项工作采用28纳米CMOS制造，在0.6V电源电压和150MHz时钟频率下，ResNet56和MobileNetV2的优化能效分别达到了14.9 TOPS/W和12.1 TOPS/W，与最先进的工作相比，能效提高了1.33倍和8.28倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy

With the development of convolutional neural networks (CNN) across various domains, the growth in network structure complexity and computational load has increasingly become a research focus in the deployment of neural networks. The key to current research on neural network accelerators lies in striking a balance between computational accuracy and energy efficiency. This paper proposes a software-hardware co-design to strike the balance for CNN edge applications. On the hardware side, a 3-dimensional tensor engine (3D-TE), achieved with reconfigurable Tensor Processing Units (TPUs), is introduced for efficient convolution computation. We optimize the CNN dataflow on 3D-TE using a dimension reshaping method for feature maps rearrangement, and a double stationary dataflow scheduling to reduce memory access. This paper adopts a configurable approximate multiplier design based on Boolean Matrix Factorization (BMF) based logic synthesis applied in the architecture of TPU. The proposed 3D-TE, characterized by its configurable precision, enables the TPUs to dynamically adapt the bitwidth of features and weights in response to varying precision requirements. On the software side, a hessian-guided layer precision mapping is adopted to reduce unnecessary computational overhead, and a progressive re-training approach is proposed to enable a better approximation configuration and higher power reduction. Fabricated on 28-nm CMOS, this work achieves an optimized energy efficiency of 14.9 TOPS/W and 12.1 TOPS/W for ResNet56 and MobileNetV2 respectively, with 0.6V supply voltage and 150MHz clock frequency, representing an improvement of

$1.33\times \sim 8.28\times $

over the state-of-the-art works.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.

期刊最新文献

Table of Contents IEEE Circuits and Systems Society Information TechRxiv: Share Your Preprint Research with the World! IEEE Transactions on Circuits and Systems--I: Regular Papers Information for Authors Guest Editorial Special Issue on the International Symposium on Integrated Circuits and Systems—ISICAS 2024