DATIC:一种具有动态信道跳过和映射的基于内存的CNN处理器中的数据感知时域计算

IEEE Open Journal of the Solid-State Circuits Society Pub Date : 2022-10-25 DOI:10.1109/OJSSCS.2022.3216562

Jianxun Yang;Yuyao Kong;Yixuan Li;Chenfu Guo;Hao Sun;Leibo Liu;Shaojun Wei;Jun Yang;Shouyi Yin

{"title":"DATIC:一种具有动态信道跳过和映射的基于内存的CNN处理器中的数据感知时域计算","authors":"Jianxun Yang;Yuyao Kong;Yixuan Li;Chenfu Guo;Hao Sun;Leibo Liu;Shaojun Wei;Jun Yang;Shouyi Yin","doi":"10.1109/OJSSCS.2022.3216562","DOIUrl":null,"url":null,"abstract":"Due to the low-power priority of analog delay-based computation, time-domain computing-in-memory (TD-CIM) presents a splendid potential for energy-constrained edge and IoT scenarios deploying convolutional neural networks (CNNs). However, the latency in delay-based computation is proportional to the numbers and values of multiplications-and-accumulations (MACs), bottlenecking the throughput of previous data-agnostic TD-CIM-based processors which compute complete convolutions in a fixed MAC mapping manner. First, some output activations in each layer of CNNs contribute less to the final classification results, which are insignificant and can be substituted by sums of partial MACs, with a marginal accuracy degradation. Thus, complete convolution computations lead to redundant MACs. Second, activations and weights vary with input images and models. Fixed MAC mapping leads to unbalanced MAC values on delay chains, causing long idle time and latency. To address that, we design a data-aware TD-CIM-based CNN processor, DATIC, with three techniques to reduce latency: 1) a channel-skipping TD-CIM macro to remove redundant MACs for insignificant output activations (IOAs), by storing activations stationary in SRAM bitcells and shifting weights to perform only imperative MACs; 2) a convolution-order programming unit to reduce overhead of skipping redundant MACs for IOAs with random positions on feature maps; and 3) an activation-weight-adaptive channel-mapping scheduler to balance the latency of delay chains by dynamically altering the convolution mapping manner. Implemented under TSMC 28-nm technology, DATIC achieves 622.9-GOPS throughput and 32.7-TOPS/W energy efficiency for ResNet-18 with 2-b weights and 8-b activations.","PeriodicalId":100633,"journal":{"name":"IEEE Open Journal of the Solid-State Circuits Society","volume":"2 ","pages":"244-258"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8782712/9733783/09927338.pdf","citationCount":"0","resultStr":"{\"title\":\"DATIC: A Data-Aware Time-Domain Computing-in-Memory-Based CNN Processor With Dynamic Channel Skipping and Mapping\",\"authors\":\"Jianxun Yang;Yuyao Kong;Yixuan Li;Chenfu Guo;Hao Sun;Leibo Liu;Shaojun Wei;Jun Yang;Shouyi Yin\",\"doi\":\"10.1109/OJSSCS.2022.3216562\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the low-power priority of analog delay-based computation, time-domain computing-in-memory (TD-CIM) presents a splendid potential for energy-constrained edge and IoT scenarios deploying convolutional neural networks (CNNs). However, the latency in delay-based computation is proportional to the numbers and values of multiplications-and-accumulations (MACs), bottlenecking the throughput of previous data-agnostic TD-CIM-based processors which compute complete convolutions in a fixed MAC mapping manner. First, some output activations in each layer of CNNs contribute less to the final classification results, which are insignificant and can be substituted by sums of partial MACs, with a marginal accuracy degradation. Thus, complete convolution computations lead to redundant MACs. Second, activations and weights vary with input images and models. Fixed MAC mapping leads to unbalanced MAC values on delay chains, causing long idle time and latency. To address that, we design a data-aware TD-CIM-based CNN processor, DATIC, with three techniques to reduce latency: 1) a channel-skipping TD-CIM macro to remove redundant MACs for insignificant output activations (IOAs), by storing activations stationary in SRAM bitcells and shifting weights to perform only imperative MACs; 2) a convolution-order programming unit to reduce overhead of skipping redundant MACs for IOAs with random positions on feature maps; and 3) an activation-weight-adaptive channel-mapping scheduler to balance the latency of delay chains by dynamically altering the convolution mapping manner. Implemented under TSMC 28-nm technology, DATIC achieves 622.9-GOPS throughput and 32.7-TOPS/W energy efficiency for ResNet-18 with 2-b weights and 8-b activations.\",\"PeriodicalId\":100633,\"journal\":{\"name\":\"IEEE Open Journal of the Solid-State Circuits Society\",\"volume\":\"2 \",\"pages\":\"244-258\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/8782712/9733783/09927338.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Open Journal of the Solid-State Circuits Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/9927338/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Solid-State Circuits Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9927338/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

由于基于模拟延迟的计算具有低功耗优先级，时域内存计算（TD-CIM）在部署卷积神经网络（CNNs）的能量受限边缘和物联网场景中具有巨大潜力。然而，基于延迟的计算中的延迟与乘法和累加（MAC）的数量和值成比例，这阻碍了以前的数据不可知的基于TD CIM的处理器的吞吐量，这些处理器以固定的MAC映射方式计算完整的卷积。首先，每层细胞神经网络中的一些输出激活对最终分类结果的贡献较小，这些结果是不重要的，可以用部分MAC的总和来代替，具有边际精度下降。因此，完整的卷积计算会导致冗余MAC。其次，激活和权重随输入图像和模型的不同而变化。固定的MAC映射会导致延迟链上的MAC值不平衡，导致长的空闲时间和延迟。为了解决这一问题，我们设计了一个基于数据感知TD-CIM的CNN处理器DATIC，该处理器具有三种技术来减少延迟：1）一个跳过信道的TD-CIM宏，通过将激活固定存储在SRAM位单元中并移动权重以仅执行命令性MAC，来删除不重要的输出激活（IOA）的冗余MAC；2）卷积顺序编程单元，用于减少跳过特征图上具有随机位置的IOA的冗余MAC的开销；以及3）激活权重自适应信道映射调度器，用于通过动态改变卷积映射方式来平衡延迟链的延迟。在台积电28纳米技术下实施，DATIC实现了622.9-GOPS的吞吐量和32.7-TOPS/W的能量效率，ResNet-18具有2-b的重量和8-b的激活。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DATIC: A Data-Aware Time-Domain Computing-in-Memory-Based CNN Processor With Dynamic Channel Skipping and Mapping

Due to the low-power priority of analog delay-based computation, time-domain computing-in-memory (TD-CIM) presents a splendid potential for energy-constrained edge and IoT scenarios deploying convolutional neural networks (CNNs). However, the latency in delay-based computation is proportional to the numbers and values of multiplications-and-accumulations (MACs), bottlenecking the throughput of previous data-agnostic TD-CIM-based processors which compute complete convolutions in a fixed MAC mapping manner. First, some output activations in each layer of CNNs contribute less to the final classification results, which are insignificant and can be substituted by sums of partial MACs, with a marginal accuracy degradation. Thus, complete convolution computations lead to redundant MACs. Second, activations and weights vary with input images and models. Fixed MAC mapping leads to unbalanced MAC values on delay chains, causing long idle time and latency. To address that, we design a data-aware TD-CIM-based CNN processor, DATIC, with three techniques to reduce latency: 1) a channel-skipping TD-CIM macro to remove redundant MACs for insignificant output activations (IOAs), by storing activations stationary in SRAM bitcells and shifting weights to perform only imperative MACs; 2) a convolution-order programming unit to reduce overhead of skipping redundant MACs for IOAs with random positions on feature maps; and 3) an activation-weight-adaptive channel-mapping scheduler to balance the latency of delay chains by dynamically altering the convolution mapping manner. Implemented under TSMC 28-nm technology, DATIC achieves 622.9-GOPS throughput and 32.7-TOPS/W energy efficiency for ResNet-18 with 2-b weights and 8-b activations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Open Journal of the Solid-State Circuits Society

自引率

0.00%

发文量