L. Mei, Mohit Dandekar, D. Rodopoulos, J. Constantin, P. Debacker, R. Lauwereins, M. Verhelst
{"title":"Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference","authors":"L. Mei, Mohit Dandekar, D. Rodopoulos, J. Constantin, P. Debacker, R. Lauwereins, M. Verhelst","doi":"10.1109/AICAS.2019.8771481","DOIUrl":null,"url":null,"abstract":"To enable energy-efficient embedded execution of Deep Neural Networks (DNNs), the critical sections of these workloads, their multiply-accumulate (MAC) operations, need to be carefully optimized. The SotA pursues this through run-time precision-scalable MAC operators, which can support the varying precision needs of DNNs in an energy-efficient way. Yet, to implement the adaptable precision MAC operation, most SotA solutions rely on separately optimized low precision multipliers and a precision-variable accumulation scheme, with the possible disadvantages of a high control complexity and degraded throughput. This paper, first optimizes one of the most effective SotA techniques to support fully-connected DNN layers. This mode, exploiting the transformation of a high precision multiplier into independent parallel low-precision multipliers, will be called the Sum Separate (SS) mode. In addition, this work suggests an alternative low-precision scheme, i.e. the implicit accumulation of multiple low precision products within the multiplier itself, called the Sum Together (ST) mode. Based on the two types of MAC arrangements explored, corresponding architectures have been proposed to implement DNN processing. The two architectures, yielding the same throughput, are compared in different working precisions (2/4/8/16-bit), based on Post-Synthesis simulation. The result shows that the proposed ST-Mode based architecture outperforms the earlier SS-Mode by up to ×1.6 on Energy Efficiency (TOPS/W) and ×1.5 on Area Efficiency (GOPS/mm2).","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS.2019.8771481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
To enable energy-efficient embedded execution of Deep Neural Networks (DNNs), the critical sections of these workloads, their multiply-accumulate (MAC) operations, need to be carefully optimized. The SotA pursues this through run-time precision-scalable MAC operators, which can support the varying precision needs of DNNs in an energy-efficient way. Yet, to implement the adaptable precision MAC operation, most SotA solutions rely on separately optimized low precision multipliers and a precision-variable accumulation scheme, with the possible disadvantages of a high control complexity and degraded throughput. This paper, first optimizes one of the most effective SotA techniques to support fully-connected DNN layers. This mode, exploiting the transformation of a high precision multiplier into independent parallel low-precision multipliers, will be called the Sum Separate (SS) mode. In addition, this work suggests an alternative low-precision scheme, i.e. the implicit accumulation of multiple low precision products within the multiplier itself, called the Sum Together (ST) mode. Based on the two types of MAC arrangements explored, corresponding architectures have been proposed to implement DNN processing. The two architectures, yielding the same throughput, are compared in different working precisions (2/4/8/16-bit), based on Post-Synthesis simulation. The result shows that the proposed ST-Mode based architecture outperforms the earlier SS-Mode by up to ×1.6 on Energy Efficiency (TOPS/W) and ×1.5 on Area Efficiency (GOPS/mm2).
为了实现深度神经网络(dnn)的节能嵌入式执行,需要仔细优化这些工作负载的关键部分,即它们的乘法累积(MAC)操作。SotA通过运行时精度可扩展的MAC运营商来实现这一目标,该运营商可以以节能的方式支持dnn的不同精度需求。然而,为了实现自适应精度MAC操作,大多数SotA解决方案依赖于单独优化的低精度乘子和精度变量累积方案,这可能具有高控制复杂性和降低吞吐量的缺点。本文首先优化了最有效的SotA技术之一,以支持完全连接的DNN层。这种模式利用高精度乘法器转换成独立的并行低精度乘法器,将被称为和分离(SS)模式。此外,这项工作提出了另一种低精度方案,即乘法器本身内多个低精度产品的隐式积累,称为Sum Together (ST)模式。基于所探索的两种类型的MAC安排,提出了相应的架构来实现深度神经网络处理。两种架构,产生相同的吞吐量,在不同的工作精度(2/4/8/16位)进行比较,基于合成后仿真。结果表明,所提出的基于st模式的体系结构在能量效率(TOPS/W)和面积效率(GOPS/mm2)上分别优于早期的ss模式×1.6和×1.5。