首页 > 最新文献

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design最新文献

英文 中文
Multi-Complexity-Loss DNAS for Energy-Efficient and Memory-Constrained Deep Neural Networks 节能和内存约束深度神经网络的多重复杂性损失dna
Matteo Risso, A. Burrello, L. Benini, E. Macii, M. Poncino, D. J. Pagliari
Neural Architecture Search (NAS) is increasingly popular to automatically explore the accuracy versus computational complexity trade-off of Deep Learning (DL) architectures. When targeting tiny edge devices, the main challenge for DL deployment is matching the tight memory constraints, hence most NAS algorithms consider model size as the complexity metric. Other methods reduce the energy or latency of DL models by trading off accuracy and number of inference operations. Energy and memory are rarely considered simultaneously, in particular by low-search-cost Differentiable NAS (DNAS) solutions. We overcome this limitation proposing the first DNAS that directly addresses the most realistic scenario from a designer’s perspective: the co-optimization of accuracy and energy (or latency) under a memory constraint, determined by the target HW. We do so by combining two complexity-dependent loss functions during training, with independent strength. Testing on three edge-relevant tasks from the MLPerf Tiny benchmark suite, we obtain rich Pareto sets of architectures in the energy vs. accuracy space, with memory footprints constraints spanning from 75% to 6.25% of the baseline networks. When deployed on a commercial edge device, the STM NUCLEO-H743ZI2, our networks span a range of 2.18x in energy consumption and 4.04% in accuracy for the same memory constraint, and reduce energy by up to 2.2 × with negligible accuracy drop with respect to the baseline.
神经架构搜索(NAS)越来越受欢迎,用于自动探索深度学习(DL)架构的准确性与计算复杂性之间的权衡。当目标是小型边缘设备时,深度学习部署的主要挑战是匹配严格的内存约束,因此大多数NAS算法将模型大小作为复杂性度量。其他方法通过权衡准确性和推理操作的数量来减少深度学习模型的能量或延迟。很少同时考虑能量和内存,特别是在低搜索成本的可微分NAS (DNAS)解决方案中。我们克服了这一限制,提出了第一个从设计师的角度直接解决最现实情况的dna:在内存约束下,由目标HW决定的准确性和能量(或延迟)的共同优化。我们通过在训练过程中结合两个复杂度相关的损失函数来实现这一目标,它们具有独立的强度。在MLPerf Tiny基准测试套件中的三个边缘相关任务上进行测试,我们在能量与精度空间中获得了丰富的Pareto架构集,内存占用约束从基准网络的75%到6.25%不等。当部署在商业边缘设备STM NUCLEO-H743ZI2上时,我们的网络在相同内存约束下的能耗范围为2.18倍,精度范围为4.04%,并且减少能量高达2.2倍,相对于基线的精度下降可以忽略不计。
{"title":"Multi-Complexity-Loss DNAS for Energy-Efficient and Memory-Constrained Deep Neural Networks","authors":"Matteo Risso, A. Burrello, L. Benini, E. Macii, M. Poncino, D. J. Pagliari","doi":"10.1145/3531437.3539720","DOIUrl":"https://doi.org/10.1145/3531437.3539720","url":null,"abstract":"Neural Architecture Search (NAS) is increasingly popular to automatically explore the accuracy versus computational complexity trade-off of Deep Learning (DL) architectures. When targeting tiny edge devices, the main challenge for DL deployment is matching the tight memory constraints, hence most NAS algorithms consider model size as the complexity metric. Other methods reduce the energy or latency of DL models by trading off accuracy and number of inference operations. Energy and memory are rarely considered simultaneously, in particular by low-search-cost Differentiable NAS (DNAS) solutions. We overcome this limitation proposing the first DNAS that directly addresses the most realistic scenario from a designer’s perspective: the co-optimization of accuracy and energy (or latency) under a memory constraint, determined by the target HW. We do so by combining two complexity-dependent loss functions during training, with independent strength. Testing on three edge-relevant tasks from the MLPerf Tiny benchmark suite, we obtain rich Pareto sets of architectures in the energy vs. accuracy space, with memory footprints constraints spanning from 75% to 6.25% of the baseline networks. When deployed on a commercial edge device, the STM NUCLEO-H743ZI2, our networks span a range of 2.18x in energy consumption and 4.04% in accuracy for the same memory constraint, and reduce energy by up to 2.2 × with negligible accuracy drop with respect to the baseline.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123237739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Neural Contextual Bandits Based Dynamic Sensor Selection for Low-Power Body-Area Networks 基于神经上下文强盗的低功耗体域网络动态传感器选择
B. U. Demirel, Luke Chen, M. A. Faruque
Providing health monitoring devices with machine intelligence is important for enabling automatic mobile healthcare applications. However, this brings additional challenges due to the resource scarcity of these devices. This work introduces a neural contextual bandits based dynamic sensor selection methodology for high-performance and resource-efficient body-area networks to realize next generation mobile health monitoring devices. The methodology utilizes contextual bandits to select the most informative sensor combinations during runtime and ignore redundant data for decreasing transmission and computing power in a body area network (BAN). The proposed method has been validated using one of the most common health monitoring applications: cardiac activity monitoring. Solutions from our proposed method are compared against those from related works in terms of classification performance and energy while considering the communication energy consumption. Our final solutions could reach 78.8% AU-PRC on the PTB-XL ECG dataset for cardiac abnormality detection while decreasing the overall energy consumption and computational energy by 3.7 × and 4.3 ×, respectively.
为健康监测设备提供机器智能对于启用自动移动医疗保健应用程序非常重要。然而,由于这些设备的资源稀缺,这带来了额外的挑战。这项工作介绍了一种基于神经上下文强盗的动态传感器选择方法,用于高性能和资源高效的体域网络,以实现下一代移动健康监测设备。该方法利用上下文强盗在运行时选择最具信息量的传感器组合,并忽略冗余数据以降低体域网络(BAN)的传输和计算能力。所提出的方法已被验证使用最常见的健康监测应用之一:心脏活动监测。在考虑通信能耗的情况下,将本文提出的算法在分类性能和能耗方面与相关研究结果进行了比较。我们最终的解决方案可以在PTB-XL ECG数据集上达到78.8%的AU-PRC,用于心脏异常检测,同时将总能耗和计算能量分别降低3.7 x和4.3 x。
{"title":"Neural Contextual Bandits Based Dynamic Sensor Selection for Low-Power Body-Area Networks","authors":"B. U. Demirel, Luke Chen, M. A. Faruque","doi":"10.1145/3531437.3539713","DOIUrl":"https://doi.org/10.1145/3531437.3539713","url":null,"abstract":"Providing health monitoring devices with machine intelligence is important for enabling automatic mobile healthcare applications. However, this brings additional challenges due to the resource scarcity of these devices. This work introduces a neural contextual bandits based dynamic sensor selection methodology for high-performance and resource-efficient body-area networks to realize next generation mobile health monitoring devices. The methodology utilizes contextual bandits to select the most informative sensor combinations during runtime and ignore redundant data for decreasing transmission and computing power in a body area network (BAN). The proposed method has been validated using one of the most common health monitoring applications: cardiac activity monitoring. Solutions from our proposed method are compared against those from related works in terms of classification performance and energy while considering the communication energy consumption. Our final solutions could reach 78.8% AU-PRC on the PTB-XL ECG dataset for cardiac abnormality detection while decreasing the overall energy consumption and computational energy by 3.7 × and 4.3 ×, respectively.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"19 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114032686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs 确定多加速器soc的最佳相干接口的综合方法
K. Bhardwaj, Marton Havasi, Yuan Yao, D. Brooks, José Miguel Hernández-Lobato, Gu-Yeon Wei
Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.
现代的片上系统(soc)不仅包括通用的cpu,还包括专门的硬件加速器。通常,要将加速器与内存层次结构集成,有三种一致性模型选择:无一致性、与最后一级缓存(LLC)一致以及基于私有缓存的完全一致性。然而,对于复杂的多加速器SoC的加速器来说,哪种相干模型是最优的,目前的研究非常有限。本文的重点是为SoC及其目标应用确定一个成本敏感的相干接口:在考虑工作负载特征和系统级争用的情况下,为优化其功率和性能的加速器找到最佳相干模型。提出了一种新的综合方法,使用贝叶斯优化来有效地找到使用gem5-Aladdin架构模拟器建模的soc的成本感知相干接口。为了进行完整的分析,除了已经支持的无相干和完全相干之外,gem5-Aladdin还扩展到支持LLC相干。对于具有不同加速器级并行性的异构SoC应用,所提出的框架可以快速找到具有成本意识的相干接口,这些接口比其他常用的相干接口具有显着的性能和功耗优势。
{"title":"A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs","authors":"K. Bhardwaj, Marton Havasi, Yuan Yao, D. Brooks, José Miguel Hernández-Lobato, Gu-Yeon Wei","doi":"10.1145/3370748.3406564","DOIUrl":"https://doi.org/10.1145/3370748.3406564","url":null,"abstract":"Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128375408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A 1.2-V, 1.8-GHz low-power PLL using a class-F VCO for driving 900-MHz SRD band SC-circuits 一个使用f类压控振荡器的1.2 v, 1.8 ghz低功耗锁相环,用于驱动900 mhz SRD频段sc电路
Tim Schumacher, Markus Stadelmayer, Thomas Faseth, H. Pretl
This work presents a 1.6 GHz to 2 GHz integer PLL with 2 MHz stepping, which is optimized for driving low-power 180 nm switched-capacitor (SC) circuits at a 1.2 V supply. To reduce the overall power consumption, a class-F VCO is implemented. Due to enriched odd harmonics of the oscillator, a rectangular oscillator signal is generated, which allows omitting output buffering stages. The rectangular signal results in a lowered power consumption and enables to directly drive SC-filters and an RF-divider using the oscillator signal. In addition, the proposed RF-divider includes a differential 4-phase signal generation at 868 MHz and 915 MHz SRD band frequencies that can be used for complex modulation schemes. With a fully integrated loop-filter, a maximum of integration is achieved. A test-chip was manufactured in a 1P6M 180 nm CMOS technology with triple-well option and confirms a PLL with a total active power consumption of 4.1 mW. It achieves a phase noise of -111 dBc/Hz at 1 MHz offset and a -42 dBc spurious response from a 1 MHz reference.
这项工作提出了一个1.6 GHz至2 GHz整数锁相环,具有2 MHz步进,优化用于在1.2 V电源下驱动低功耗180 nm开关电容(SC)电路。为了降低整体功耗,采用了f类压控振荡器。由于振荡器的奇次谐波丰富,产生一个矩形振荡器信号,这允许省略输出缓冲级。矩形信号降低了功耗,并能够使用振荡器信号直接驱动sc滤波器和rf分压器。此外,所提出的rf分频器包括在868 MHz和915 MHz SRD频带频率下产生的差分4相信号,可用于复杂的调制方案。使用完全集成的环路滤波器,可以实现最大的集成。测试芯片采用1P6M 180 nm CMOS技术制造,具有三孔选项,并确认了总有功功耗为4.1 mW的锁相环。它在1 MHz偏移时实现了-111 dBc/Hz的相位噪声,在1 MHz参考时实现了-42 dBc的杂散响应。
{"title":"A 1.2-V, 1.8-GHz low-power PLL using a class-F VCO for driving 900-MHz SRD band SC-circuits","authors":"Tim Schumacher, Markus Stadelmayer, Thomas Faseth, H. Pretl","doi":"10.1145/3370748.3406551","DOIUrl":"https://doi.org/10.1145/3370748.3406551","url":null,"abstract":"This work presents a 1.6 GHz to 2 GHz integer PLL with 2 MHz stepping, which is optimized for driving low-power 180 nm switched-capacitor (SC) circuits at a 1.2 V supply. To reduce the overall power consumption, a class-F VCO is implemented. Due to enriched odd harmonics of the oscillator, a rectangular oscillator signal is generated, which allows omitting output buffering stages. The rectangular signal results in a lowered power consumption and enables to directly drive SC-filters and an RF-divider using the oscillator signal. In addition, the proposed RF-divider includes a differential 4-phase signal generation at 868 MHz and 915 MHz SRD band frequencies that can be used for complex modulation schemes. With a fully integrated loop-filter, a maximum of integration is achieved. A test-chip was manufactured in a 1P6M 180 nm CMOS technology with triple-well option and confirms a PLL with a total active power consumption of 4.1 mW. It achieves a phase noise of -111 dBc/Hz at 1 MHz offset and a -42 dBc spurious response from a 1 MHz reference.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114731474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Approximate inference systems (AxIS): end-to-end approximations for energy-efficient inference at the edge 近似推理系统(AxIS):在边缘进行节能推理的端到端近似
Soumendu Kumar Ghosh, Arnab Raha, V. Raghunathan
The rapid proliferation of the Internet-of-Things (IoT) and the dramatic resurgence of artificial intelligence (AI) based application workloads has led to immense interest in performing inference on energy-constrained edge devices. Approximate computing (a design paradigm that yields large energy savings at the cost of a small degradation in application quality) is a promising technique to enable energy-efficient inference at the edge. This paper introduces the concept of an approximate inference system (AxIS) and proposes a systematic methodology to perform joint approximations across different subsystems in a deep neural network-based inference system, leading to significant energy benefits compared to approximating individual subsystems in isolation. We use a smart camera system that executes various convolutional neural network (CNN) based image recognition applications to illustrate how the sensor, memory, compute, and communication subsystems can all be approximated synergistically. We demonstrate our proposed methodology using two variants of a smart camera system: (a) Camedge, where the CNN executes locally on the edge device, and (b) Camcloud, where the edge device sends the captured image to a remote cloud server that executes the CNN. We have prototyped such an approximate inference system using an Altera Stratix IV GX-based Terasic TR4-230 FPGA development board. Experimental results obtained using six CNNs demonstrate significant energy savings (around 1.7× for Camedge and 3.5× for Camcloud) for minimal (< 1%) loss in application quality. Compared to approximating a single subsystem in isolation, AxIS achieves additional energy benefits of 1.6×--1.7× (Camedge) and 1.4×--3.4× (Camcloud) on average for minimal application-level quality loss.
物联网(IoT)的快速扩散和基于人工智能(AI)的应用工作负载的急剧复苏,导致人们对在能量受限的边缘设备上执行推理产生了极大的兴趣。近似计算(一种以应用程序质量的小下降为代价产生大量节能的设计范例)是一种很有前途的技术,可以在边缘实现节能推断。本文介绍了近似推理系统(AxIS)的概念,并提出了一种系统的方法,在基于深度神经网络的推理系统中跨不同子系统执行联合逼近,与孤立地逼近单个子系统相比,可以带来显着的能量效益。我们使用一个智能相机系统,该系统执行各种基于卷积神经网络(CNN)的图像识别应用程序,以说明传感器,内存,计算和通信子系统如何都可以近似协同。我们使用智能相机系统的两种变体来演示我们提出的方法:(a) Camedge,其中CNN在边缘设备上本地执行,以及(b) Camcloud,其中边缘设备将捕获的图像发送到执行CNN的远程云服务器。我们使用基于Altera Stratix IV gx的Terasic TR4-230 FPGA开发板制作了这样一个近似推理系统的原型。使用六个cnn获得的实验结果表明,在应用质量损失最小(< 1%)的情况下,显著节能(Camedge约1.7倍,Camcloud约3.5倍)。与孤立的近似单个子系统相比,AxIS实现了1.6×—1.7× (Camedge)和1.4×—3.4× (Camcloud)的额外能量效益,平均可实现最小的应用级质量损失。
{"title":"Approximate inference systems (AxIS): end-to-end approximations for energy-efficient inference at the edge","authors":"Soumendu Kumar Ghosh, Arnab Raha, V. Raghunathan","doi":"10.1145/3370748.3406575","DOIUrl":"https://doi.org/10.1145/3370748.3406575","url":null,"abstract":"The rapid proliferation of the Internet-of-Things (IoT) and the dramatic resurgence of artificial intelligence (AI) based application workloads has led to immense interest in performing inference on energy-constrained edge devices. Approximate computing (a design paradigm that yields large energy savings at the cost of a small degradation in application quality) is a promising technique to enable energy-efficient inference at the edge. This paper introduces the concept of an approximate inference system (AxIS) and proposes a systematic methodology to perform joint approximations across different subsystems in a deep neural network-based inference system, leading to significant energy benefits compared to approximating individual subsystems in isolation. We use a smart camera system that executes various convolutional neural network (CNN) based image recognition applications to illustrate how the sensor, memory, compute, and communication subsystems can all be approximated synergistically. We demonstrate our proposed methodology using two variants of a smart camera system: (a) Camedge, where the CNN executes locally on the edge device, and (b) Camcloud, where the edge device sends the captured image to a remote cloud server that executes the CNN. We have prototyped such an approximate inference system using an Altera Stratix IV GX-based Terasic TR4-230 FPGA development board. Experimental results obtained using six CNNs demonstrate significant energy savings (around 1.7× for Camedge and 3.5× for Camcloud) for minimal (< 1%) loss in application quality. Compared to approximating a single subsystem in isolation, AxIS achieves additional energy benefits of 1.6×--1.7× (Camedge) and 1.4×--3.4× (Camcloud) on average for minimal application-level quality loss.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121760469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
STINT 工作
Pub Date : 2020-08-10 DOI: 10.1007/978-3-642-41714-6_197299
Tao-Yi Lee, Khuong Vo, Wongi Baek, Michelle Khine, N. Dutt
{"title":"STINT","authors":"Tao-Yi Lee, Khuong Vo, Wongi Baek, Michelle Khine, N. Dutt","doi":"10.1007/978-3-642-41714-6_197299","DOIUrl":"https://doi.org/10.1007/978-3-642-41714-6_197299","url":null,"abstract":"","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"364 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134480540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SparTANN
H. Sim, Jooyeon Choi, Jongeun Lee
While sparsity has been exploited in many inference accelerators, not much work is done for training accelerators. Exploiting sparsity in training accelerators involves multiple issues, including where to find sparsity, how to exploit sparsity, and how to create more sparsity. In this paper we present a novel sparse training architecture that can exploit sparsity in gradient tensors in both back propagation and weight update computation. We also propose a single-pass sparsification algorithm, which is a hardware-friendly version of a recently proposed sparse training algorithm, that can create additional sparsity aggressively during training. Our experimental results using large networks such as AlexNet and GoogleNet demonstrate that our sparse training architecture can accelerate convolution layer training time by 4.20~8.88× over baseline dense training without accuracy loss, and further increase the training speed by 7.30~11.87× over the baseline with minimal accuracy loss.
{"title":"SparTANN","authors":"H. Sim, Jooyeon Choi, Jongeun Lee","doi":"10.1145/3370748.3406554","DOIUrl":"https://doi.org/10.1145/3370748.3406554","url":null,"abstract":"While sparsity has been exploited in many inference accelerators, not much work is done for training accelerators. Exploiting sparsity in training accelerators involves multiple issues, including where to find sparsity, how to exploit sparsity, and how to create more sparsity. In this paper we present a novel sparse training architecture that can exploit sparsity in gradient tensors in both back propagation and weight update computation. We also propose a single-pass sparsification algorithm, which is a hardware-friendly version of a recently proposed sparse training algorithm, that can create additional sparsity aggressively during training. Our experimental results using large networks such as AlexNet and GoogleNet demonstrate that our sparse training architecture can accelerate convolution layer training time by 4.20~8.88× over baseline dense training without accuracy loss, and further increase the training speed by 7.30~11.87× over the baseline with minimal accuracy loss.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114748290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sound event detection with binary neural networks on tightly power-constrained IoT devices 在功率严格限制的物联网设备上使用二进制神经网络进行声音事件检测
G. Cerutti, Renzo Andri, L. Cavigelli, Elisabetta Farella, M. Magno, L. Benini
Sound event detection (SED) is a hot topic in consumer and smart city applications. Existing approaches based on deep neural networks (DNNs) are very effective, but highly demanding in terms of memory, power, and throughput when targeting ultra-low power always-on devices. Latency, availability, cost, and privacy requirements are pushing recent IoT systems to process the data on the node, close to the sensor, with a very limited energy supply, and tight constraints on the memory size and processing capabilities precluding to run state-of-the-art DNNs. In this paper, we explore the combination of extreme quantization to a small-footprint binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller. Starting from an existing CNN for SED whose footprint (815 kB) exceeds the 512 kB of memory available on our platform, we retrain the network using binary filters and activations to match these memory constraints. (Fully) binary neural networks come with a natural drop in accuracy of 12-18% on the challenging ImageNet object recognition challenge compared to their equivalent full-precision baselines. This BNN reaches a 77.9% accuracy, just 7% lower than the full-precision version, with 58 kB (7.2× less) for the weights and 262 kB (2.4× less) memory in total. With our BNN implementation, we reach a peak throughput of 4.6 GMAC/s and 1.5 GMAC/s over the full network, including preprocessing with Mel bins, which corresponds to an efficiency of 67.1 GMAC/s/W and 31.3 GMAC/s/W, respectively. Compared to the performance of an ARM Cortex-M4 implementation, our system has a 10.3× faster execution time and a 51.1× higher energy-efficiency.
声音事件检测(SED)是消费类和智慧城市应用领域的一个热点。现有的基于深度神经网络(dnn)的方法非常有效,但在针对超低功耗的始终在线设备时,对内存、功率和吞吐量的要求很高。延迟、可用性、成本和隐私要求正在推动最近的物联网系统在节点上处理数据,靠近传感器,能量供应非常有限,并且对内存大小和处理能力的严格限制阻碍了运行最先进的dnn。在本文中,我们探索了将极端量化与基于risc - v的高能效(8+1)核GAP8微控制器的小占用二进制神经网络(BNN)相结合。从现有的SED CNN开始,其占用空间(815 kB)超过了我们平台上可用的512 kB内存,我们使用二进制过滤器和激活来重新训练网络以匹配这些内存约束。与同等的全精度基线相比,在具有挑战性的ImageNet对象识别挑战中,(完全)二元神经网络的准确率自然下降了12-18%。该BNN的准确率达到77.9%,仅比全精度版本低7%,权重为58 kB(减少7.2倍),内存总量为262 kB(减少2.4倍)。通过我们的BNN实现,我们在整个网络上达到了4.6 GMAC/s和1.5 GMAC/s的峰值吞吐量,包括Mel bin的预处理,其效率分别为67.1 GMAC/s/W和31.3 GMAC/s/W。与ARM Cortex-M4实现的性能相比,我们的系统的执行时间加快了10.3倍,能效提高了51.1倍。
{"title":"Sound event detection with binary neural networks on tightly power-constrained IoT devices","authors":"G. Cerutti, Renzo Andri, L. Cavigelli, Elisabetta Farella, M. Magno, L. Benini","doi":"10.1145/3370748.3406588","DOIUrl":"https://doi.org/10.1145/3370748.3406588","url":null,"abstract":"Sound event detection (SED) is a hot topic in consumer and smart city applications. Existing approaches based on deep neural networks (DNNs) are very effective, but highly demanding in terms of memory, power, and throughput when targeting ultra-low power always-on devices. Latency, availability, cost, and privacy requirements are pushing recent IoT systems to process the data on the node, close to the sensor, with a very limited energy supply, and tight constraints on the memory size and processing capabilities precluding to run state-of-the-art DNNs. In this paper, we explore the combination of extreme quantization to a small-footprint binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller. Starting from an existing CNN for SED whose footprint (815 kB) exceeds the 512 kB of memory available on our platform, we retrain the network using binary filters and activations to match these memory constraints. (Fully) binary neural networks come with a natural drop in accuracy of 12-18% on the challenging ImageNet object recognition challenge compared to their equivalent full-precision baselines. This BNN reaches a 77.9% accuracy, just 7% lower than the full-precision version, with 58 kB (7.2× less) for the weights and 262 kB (2.4× less) memory in total. With our BNN implementation, we reach a peak throughput of 4.6 GMAC/s and 1.5 GMAC/s over the full network, including preprocessing with Mel bins, which corresponds to an efficiency of 67.1 GMAC/s/W and 31.3 GMAC/s/W, respectively. Compared to the performance of an ARM Cortex-M4 implementation, our system has a 10.3× faster execution time and a 51.1× higher energy-efficiency.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134254748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Integrating event-based dynamic vision sensors with sparse hyperdimensional computing: a low-power accelerator with online learning capability 基于事件的动态视觉传感器与稀疏超维计算的集成:具有在线学习能力的低功耗加速器
Michael Hersche, Edoardo Mello Rella, Alfio Di Mauro, L. Benini, Abbas Rahimi
We propose to embed features extracted from event-driven dynamic vision sensors to binary sparse representations in hyperdimensional (HD) space for regression. This embedding compresses events generated across 346×260 differential pixels to a sparse 8160-bit vector by applying random activation functions. The sparse representation not only simplifies inference, but also enables online learning with the same memory footprint. Specifically, it allows efficient updates by retaining binary vector components over the course of online learning that cannot be otherwise achieved with dense representations demanding multibit vector components. We demonstrate online learning capability: using estimates and confidences of an initial model trained with only 25% of data, our method continuously updates the model for the remaining 75% of data, resulting in a close match with accuracy obtained with an oracle model on ground truth labels. When mapped on an 8-core accelerator, our method also achieves lower error, latency, and energy compared to other sparse/dense alternatives. Furthermore, it is 9.84× more energy-efficient and 6.25× faster than an optimized 9-layer perceptron with comparable accuracy.
我们提出将从事件驱动的动态视觉传感器中提取的特征嵌入到高维空间的二值稀疏表示中进行回归。这种嵌入通过应用随机激活函数,将346×260差分像素上生成的事件压缩为稀疏的8160位向量。稀疏表示不仅简化了推理,而且使在线学习具有相同的内存占用。具体来说,它通过在在线学习过程中保留二进制向量组件来实现有效的更新,而这是需要多位向量组件的密集表示无法实现的。我们展示了在线学习能力:使用仅使用25%数据训练的初始模型的估计和置信度,我们的方法不断更新剩余75%数据的模型,从而与在地面真值标签上使用oracle模型获得的准确性密切匹配。当映射到8核加速器上时,与其他稀疏/密集替代方法相比,我们的方法还实现了更低的错误、延迟和能量。此外,在同等精度下,它比优化后的9层感知器节能9.84倍,速度6.25倍。
{"title":"Integrating event-based dynamic vision sensors with sparse hyperdimensional computing: a low-power accelerator with online learning capability","authors":"Michael Hersche, Edoardo Mello Rella, Alfio Di Mauro, L. Benini, Abbas Rahimi","doi":"10.1145/3370748.3406560","DOIUrl":"https://doi.org/10.1145/3370748.3406560","url":null,"abstract":"We propose to embed features extracted from event-driven dynamic vision sensors to binary sparse representations in hyperdimensional (HD) space for regression. This embedding compresses events generated across 346×260 differential pixels to a sparse 8160-bit vector by applying random activation functions. The sparse representation not only simplifies inference, but also enables online learning with the same memory footprint. Specifically, it allows efficient updates by retaining binary vector components over the course of online learning that cannot be otherwise achieved with dense representations demanding multibit vector components. We demonstrate online learning capability: using estimates and confidences of an initial model trained with only 25% of data, our method continuously updates the model for the remaining 75% of data, resulting in a close match with accuracy obtained with an oracle model on ground truth labels. When mapped on an 8-core accelerator, our method also achieves lower error, latency, and energy compared to other sparse/dense alternatives. Furthermore, it is 9.84× more energy-efficient and 6.25× faster than an optimized 9-layer perceptron with comparable accuracy.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114778068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Steady state driven power gating for lightening always-on state retention storage 稳定状态驱动的电源门控,以减轻常开状态保持存储
Taehwan Kim, Gyoung-Hwan Hyun, Taewhan Kim
It is generally known that a considerable portion of flip-flops in circuits is occupied by the ones with mux-feedback loop (called self-loop), which are the critical (inherently unavoidable) bottleneck in minimizing total (always-on) storage size for the allocation of non-uniform multi-bits for retaining flip-flop states in power gated circuits. This is because it is necessary to replace every self-loop flip-flop with a distinct retention flip-flop with at least one-bit storage for retaining its state since there is no clue as to where the flip-flop state, when waking up, comes from, i.e., from the mux-feedback loop or from the driving flip-flops other than itself. This work breaks this bottleneck by safely treating a large portion of the self-loop flip-flops as if they were the same as the flip-flops with no self-loop. Specifically, we design a novel mechanism of steady state monitoring, operating for a few cycles just before sleeping, on a partial set of self-loop flip-flops, by which the expensive state retention storage never be needed for the monitored flip-flops, contributing to a significant saving on the total size of the always- on state retention storage for power gating.
众所周知,电路中相当一部分触发器是由具有多路反馈环路(称为自环路)的触发器所占据的,这是在功率门控电路中为了保持触发器状态而分配非均匀多比特以最小化总存储大小的关键(固有不可避免的)瓶颈。这是因为有必要将每个自环触发器替换为具有至少1位存储的独特保留触发器,以保持其状态,因为没有线索表明触发器状态在唤醒时来自何处,即来自多路反馈环路或来自驱动触发器,而不是本身。这项工作通过安全地处理大部分自环触发器,就好像它们与没有自环触发器一样,打破了这个瓶颈。具体来说,我们设计了一种新的稳态监测机制,在部分自环触发器睡眠前运行几个周期,通过这种机制,被监测的触发器不需要昂贵的状态保持存储器,从而大大节省了用于电源门控的常开状态保持存储器的总尺寸。
{"title":"Steady state driven power gating for lightening always-on state retention storage","authors":"Taehwan Kim, Gyoung-Hwan Hyun, Taewhan Kim","doi":"10.1145/3370748.3406556","DOIUrl":"https://doi.org/10.1145/3370748.3406556","url":null,"abstract":"It is generally known that a considerable portion of flip-flops in circuits is occupied by the ones with mux-feedback loop (called self-loop), which are the critical (inherently unavoidable) bottleneck in minimizing total (always-on) storage size for the allocation of non-uniform multi-bits for retaining flip-flop states in power gated circuits. This is because it is necessary to replace every self-loop flip-flop with a distinct retention flip-flop with at least one-bit storage for retaining its state since there is no clue as to where the flip-flop state, when waking up, comes from, i.e., from the mux-feedback loop or from the driving flip-flops other than itself. This work breaks this bottleneck by safely treating a large portion of the self-loop flip-flops as if they were the same as the flip-flops with no self-loop. Specifically, we design a novel mechanism of steady state monitoring, operating for a few cycles just before sleeping, on a partial set of self-loop flip-flops, by which the expensive state retention storage never be needed for the monitored flip-flops, contributing to a significant saving on the total size of the always- on state retention storage for power gating.","PeriodicalId":116486,"journal":{"name":"Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121703255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1