BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks

IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Embedded Computing Systems Pub Date : 2023-09-09 DOI:10.1145/3609093
Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke
{"title":"BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks","authors":"Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke","doi":"10.1145/3609093","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":2.8000,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3609093","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BitSET:用于卷积神经网络减少计算量的位序列提前终止
卷积神经网络(cnn)在广泛的机器学习任务中表现出了卓越的性能。然而,高精度通常是以大量的计算和能源消耗为代价的,这使得它很难部署在移动和嵌入式设备上。在cnn中,计算密集型的卷积层之后通常是一个ReLU激活层,它将负输出箝制为零,从而产生很大的激活稀疏性。通过利用CNN模型中的这种稀疏性,我们提出了一种软硬件协同设计的BitSET,它在CNN推理过程中大大节省了能量。位串行BitSET加速器采用基于预测的位级提前终止技术,提前终止负输出的无效计算。为了辅助算法,我们提出了一种新的权重编码,可以用更少的比特进行更准确的预测。BitSET在预测性早期终止算法和非预测性、节能的位串行体系结构中都利用了比特级计算的减少。与UNPU(一种节能的位串行CNN加速器)相比,BitSET的平均加速速度提高了1.5倍,能效提高了1.4倍,而且由于比特级计算减少了48%,没有精度损失。将允许的精度损失放宽到1%,可以将增益提高到平均1.6倍的加速和1.4倍的能效改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems 工程技术-计算机:软件工程
CiteScore
3.70
自引率
0.00%
发文量
138
审稿时长
6 months
期刊介绍: The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.
期刊最新文献
Optimizing Dilithium Implementation with AVX2/-512 Optimizing Dilithium Implementation with AVX2/-512 Transient Fault Detection in Tensor Cores for Modern GPUs High Performance and Predictable Shared Last-level Cache for Safety-Critical Systems APB-tree: An Adaptive Pre-built Tree Indexing Scheme for NVM-based IoT Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1