Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference

L. Mei, Mohit Dandekar, D. Rodopoulos, J. Constantin, P. Debacker, R. Lauwereins, M. Verhelst
{"title":"Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference","authors":"L. Mei, Mohit Dandekar, D. Rodopoulos, J. Constantin, P. Debacker, R. Lauwereins, M. Verhelst","doi":"10.1109/AICAS.2019.8771481","DOIUrl":null,"url":null,"abstract":"To enable energy-efficient embedded execution of Deep Neural Networks (DNNs), the critical sections of these workloads, their multiply-accumulate (MAC) operations, need to be carefully optimized. The SotA pursues this through run-time precision-scalable MAC operators, which can support the varying precision needs of DNNs in an energy-efficient way. Yet, to implement the adaptable precision MAC operation, most SotA solutions rely on separately optimized low precision multipliers and a precision-variable accumulation scheme, with the possible disadvantages of a high control complexity and degraded throughput. This paper, first optimizes one of the most effective SotA techniques to support fully-connected DNN layers. This mode, exploiting the transformation of a high precision multiplier into independent parallel low-precision multipliers, will be called the Sum Separate (SS) mode. In addition, this work suggests an alternative low-precision scheme, i.e. the implicit accumulation of multiple low precision products within the multiplier itself, called the Sum Together (ST) mode. Based on the two types of MAC arrangements explored, corresponding architectures have been proposed to implement DNN processing. The two architectures, yielding the same throughput, are compared in different working precisions (2/4/8/16-bit), based on Post-Synthesis simulation. The result shows that the proposed ST-Mode based architecture outperforms the earlier SS-Mode by up to ×1.6 on Energy Efficiency (TOPS/W) and ×1.5 on Area Efficiency (GOPS/mm2).","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS.2019.8771481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

To enable energy-efficient embedded execution of Deep Neural Networks (DNNs), the critical sections of these workloads, their multiply-accumulate (MAC) operations, need to be carefully optimized. The SotA pursues this through run-time precision-scalable MAC operators, which can support the varying precision needs of DNNs in an energy-efficient way. Yet, to implement the adaptable precision MAC operation, most SotA solutions rely on separately optimized low precision multipliers and a precision-variable accumulation scheme, with the possible disadvantages of a high control complexity and degraded throughput. This paper, first optimizes one of the most effective SotA techniques to support fully-connected DNN layers. This mode, exploiting the transformation of a high precision multiplier into independent parallel low-precision multipliers, will be called the Sum Separate (SS) mode. In addition, this work suggests an alternative low-precision scheme, i.e. the implicit accumulation of multiple low precision products within the multiplier itself, called the Sum Together (ST) mode. Based on the two types of MAC arrangements explored, corresponding architectures have been proposed to implement DNN processing. The two architectures, yielding the same throughput, are compared in different working precisions (2/4/8/16-bit), based on Post-Synthesis simulation. The result shows that the proposed ST-Mode based architecture outperforms the earlier SS-Mode by up to ×1.6 on Energy Efficiency (TOPS/W) and ×1.5 on Area Efficiency (GOPS/mm2).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于高效嵌入式DNN推理的子字并行精确可扩展MAC引擎
为了实现深度神经网络(dnn)的节能嵌入式执行,需要仔细优化这些工作负载的关键部分,即它们的乘法累积(MAC)操作。SotA通过运行时精度可扩展的MAC运营商来实现这一目标,该运营商可以以节能的方式支持dnn的不同精度需求。然而,为了实现自适应精度MAC操作,大多数SotA解决方案依赖于单独优化的低精度乘子和精度变量累积方案,这可能具有高控制复杂性和降低吞吐量的缺点。本文首先优化了最有效的SotA技术之一,以支持完全连接的DNN层。这种模式利用高精度乘法器转换成独立的并行低精度乘法器,将被称为和分离(SS)模式。此外,这项工作提出了另一种低精度方案,即乘法器本身内多个低精度产品的隐式积累,称为Sum Together (ST)模式。基于所探索的两种类型的MAC安排,提出了相应的架构来实现深度神经网络处理。两种架构,产生相同的吞吐量,在不同的工作精度(2/4/8/16位)进行比较,基于合成后仿真。结果表明,所提出的基于st模式的体系结构在能量效率(TOPS/W)和面积效率(GOPS/mm2)上分别优于早期的ss模式×1.6和×1.5。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Artificial Intelligence of Things Wearable System for Cardiac Disease Detection Fast event-driven incremental learning of hand symbols Accelerating CNN-RNN Based Machine Health Monitoring on FPGA Neuromorphic networks on the SpiNNaker platform Complexity Reduction on HEVC Intra Mode Decision with modified LeNet-5
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1