Sparse convolutional neural network acceleration with lossless input feature map compression for resource-constrained systems

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IET Computers and Digital Techniques Pub Date : 2021-11-29 DOI:10.1049/cdt2.12038

Jisu Kwon, Joonho Kong, Arslan Munir

{"title":"Sparse convolutional neural network acceleration with lossless input feature map compression for resource-constrained systems","authors":"Jisu Kwon, Joonho Kong, Arslan Munir","doi":"10.1049/cdt2.12038","DOIUrl":null,"url":null,"abstract":"<p>Many recent research efforts have exploited data sparsity for the acceleration of convolutional neural network (CNN) inferences. However, the effects of data transfer between main memory and the CNN accelerator have been largely overlooked. In this work, the authors propose a CNN acceleration technique that leverages hardware/software co-design and exploits the sparsity in input feature maps (IFMs). On the software side, the authors' technique employs a novel lossless compression scheme for IFMs, which are sent to the hardware accelerator via direct memory access. On the hardware side, the authors' technique uses a CNN inference accelerator that performs convolutional layer operations with their compressed data format. With several design optimization techniques, the authors have implemented their technique in a field-programmable gate array (FPGA) system-on-chip platform and evaluated their technique for six different convolutional layers in SqueezeNet. Results reveal that the authors' technique improves the performance by 1.1×–22.6× while reducing energy consumption by 47.7%–97.4% as compared to the CPU-based execution. Furthermore, results indicate that the IFM size and transfer latency are reduced by 34.0%–85.2% and 4.4%–75.7%, respectively, compared to the case without data compression. In addition, the authors' hardware accelerator shows better performance per hardware resource with less than or comparable power consumption to the state-of-the-art FPGA-based designs.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"16 1","pages":"29-43"},"PeriodicalIF":0.8000,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12038","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computers and Digital Techniques","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cdt2.12038","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 5

Abstract

Many recent research efforts have exploited data sparsity for the acceleration of convolutional neural network (CNN) inferences. However, the effects of data transfer between main memory and the CNN accelerator have been largely overlooked. In this work, the authors propose a CNN acceleration technique that leverages hardware/software co-design and exploits the sparsity in input feature maps (IFMs). On the software side, the authors' technique employs a novel lossless compression scheme for IFMs, which are sent to the hardware accelerator via direct memory access. On the hardware side, the authors' technique uses a CNN inference accelerator that performs convolutional layer operations with their compressed data format. With several design optimization techniques, the authors have implemented their technique in a field-programmable gate array (FPGA) system-on-chip platform and evaluated their technique for six different convolutional layers in SqueezeNet. Results reveal that the authors' technique improves the performance by 1.1×–22.6× while reducing energy consumption by 47.7%–97.4% as compared to the CPU-based execution. Furthermore, results indicate that the IFM size and transfer latency are reduced by 34.0%–85.2% and 4.4%–75.7%, respectively, compared to the case without data compression. In addition, the authors' hardware accelerator shows better performance per hardware resource with less than or comparable power consumption to the state-of-the-art FPGA-based designs.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于无损输入特征映射压缩的稀疏卷积神经网络加速

最近的许多研究工作都利用数据稀疏性来加速卷积神经网络(CNN)的推理。然而，主存储器和CNN加速器之间数据传输的影响在很大程度上被忽视了。在这项工作中，作者提出了一种CNN加速技术，该技术利用硬件/软件协同设计并利用输入特征映射(ifm)中的稀疏性。在软件方面，作者的技术为ifm采用了一种新颖的无损压缩方案，通过直接存储器访问将ifm发送到硬件加速器。在硬件方面，作者的技术使用CNN推理加速器，用压缩的数据格式执行卷积层操作。通过几种设计优化技术，作者在现场可编程门阵列(FPGA)片上系统平台上实现了他们的技术，并在SqueezeNet中评估了六种不同卷积层的技术。结果表明，与基于cpu的执行相比，作者的技术提高了1.1×-22.6×的性能，同时减少了47.7%-97.4%的能耗。此外，结果表明，与没有数据压缩的情况相比，IFM大小和传输延迟分别减少了34.0% ~ 85.2%和4.4% ~ 75.7%。此外，作者的硬件加速器在每个硬件资源上显示出更好的性能，其功耗低于或与最先进的基于fpga的设计相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IET Computers and Digital Techniques 工程技术-计算机：理论方法

CiteScore

3.50

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： IET Computers & Digital Techniques publishes technical papers describing recent research and development work in all aspects of digital system-on-chip design and test of electronic and embedded systems, including the development of design automation tools (methodologies, algorithms and architectures). Papers based on the problems associated with the scaling down of CMOS technology are particularly welcome. It is aimed at researchers, engineers and educators in the fields of computer and digital systems design and test. The key subject areas of interest are: Design Methods and Tools: CAD/EDA tools, hardware description languages, high-level and architectural synthesis, hardware/software co-design, platform-based design, 3D stacking and circuit design, system on-chip architectures and IP cores, embedded systems, logic synthesis, low-power design and power optimisation. Simulation, Test and Validation: electrical and timing simulation, simulation based verification, hardware/software co-simulation and validation, mixed-domain technology modelling and simulation, post-silicon validation, power analysis and estimation, interconnect modelling and signal integrity analysis, hardware trust and security, design-for-testability, embedded core testing, system-on-chip testing, on-line testing, automatic test generation and delay testing, low-power testing, reliability, fault modelling and fault tolerance. Processor and System Architectures: many-core systems, general-purpose and application specific processors, computational arithmetic for DSP applications, arithmetic and logic units, cache memories, memory management, co-processors and accelerators, systems and networks on chip, embedded cores, platforms, multiprocessors, distributed systems, communication protocols and low-power issues. Configurable Computing: embedded cores, FPGAs, rapid prototyping, adaptive computing, evolvable and statically and dynamically reconfigurable and reprogrammable systems, reconfigurable hardware. Design for variability, power and aging: design methods for variability, power and aging aware design, memories, FPGAs, IP components, 3D stacking, energy harvesting. Case Studies: emerging applications, applications in industrial designs, and design frameworks.