首页 > 最新文献

2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)最新文献

英文 中文
DWE: Decrypting Learning with Errors with Errors DWE:用错误解密学习
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196032
S. Bian, Masayuki Hiromoto, Takashi Sato
The Learning with Errors (LWE) problem is a novel foundation of a variety of cryptographic applications, including quantumly-secure public-key encryption, digital signature, and fully homomorphic encryption. In this work, we propose an approximate decryption technique for LWE-based cryptosystems. Based on the fact that the decryption process for such systems is inherently approximate, we apply hardware-based approximate computing techniques. Rigorous experiments have shown that the proposed technique simultaneously achieved 1.3× (resp., 2.5×) speed increase, 2.06× (resp., 7.89×) area reduction, 20.5% (resp., 4×) of power reduction, and an average of 27.1% (resp., 65.6%) ciphertext size reduction for public-key encryption scheme (resp., a state-of-the-art fully homomorphic encryption scheme).
带错误学习(LWE)问题是各种密码学应用的新基础,包括量子安全公钥加密、数字签名和完全同态加密。在这项工作中,我们提出了一种基于lwe的密码系统的近似解密技术。基于这类系统的解密过程本质上是近似的,我们采用了基于硬件的近似计算技术。严格的实验表明,该技术可以同时达到1.3 x (p < 0.05)。转速提高2.5倍,转速提高2.06倍。, 7.89×)面积减少20.5% (p < 0.05)。, 4倍)的功率降低,平均27.1% (p < 0.05)。, 65.6%)公钥加密方案的密文大小减小(见图1)。(最先进的全同态加密方案)。
{"title":"DWE: Decrypting Learning with Errors with Errors","authors":"S. Bian, Masayuki Hiromoto, Takashi Sato","doi":"10.1145/3195970.3196032","DOIUrl":"https://doi.org/10.1145/3195970.3196032","url":null,"abstract":"The Learning with Errors (LWE) problem is a novel foundation of a variety of cryptographic applications, including quantumly-secure public-key encryption, digital signature, and fully homomorphic encryption. In this work, we propose an approximate decryption technique for LWE-based cryptosystems. Based on the fact that the decryption process for such systems is inherently approximate, we apply hardware-based approximate computing techniques. Rigorous experiments have shown that the proposed technique simultaneously achieved 1.3× (resp., 2.5×) speed increase, 2.06× (resp., 7.89×) area reduction, 20.5% (resp., 4×) of power reduction, and an average of 27.1% (resp., 65.6%) ciphertext size reduction for public-key encryption scheme (resp., a state-of-the-art fully homomorphic encryption scheme).","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"25 7 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83946161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Dynamic Management of Key States for Reinforcement Learning-assisted Garbage Collection to Reduce Long Tail Latency in SSD 基于强化学习辅助垃圾回收的关键状态动态管理以减少SSD长尾延迟
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196034
Won-Kyung Kang, S. Yoo
Garbage collection (GC) is one of main causes of the long-tail latency problem in storage systems. Long-tail latency due to GC is more than 100 times greater than the average latency at the 99th percentile. Therefore, due to such a long tail latency, real-time systems and quality-critical systems cannot meet the system requirements. In this study, we propose a novel key state management technique of reinforcement learning-assisted garbage collection. The purpose of this study is to dynamically manage key states from a significant number of state candidates. Dynamic management enables us to utilize suitable and frequently recurring key states at a small area cost since the full states do not have to be managed. The experimental results show that the proposed technique reduces by 22–25% the long-tail latency compared to a state-of-the-art scheme with real-world workloads.
垃圾收集(GC)是导致存储系统长尾延迟问题的主要原因之一。由于GC引起的长尾延迟比第99个百分位数的平均延迟大100倍以上。因此,由于这种长尾延迟,实时系统和质量关键型系统无法满足系统需求。在这项研究中,我们提出了一种新的强化学习辅助垃圾收集的关键状态管理技术。本研究的目的是从大量的候选状态中动态管理关键状态。动态管理使我们能够以较小的区域成本利用合适且经常重复出现的关键状态,因为不需要管理完整的状态。实验结果表明,与现实工作负载的最先进方案相比,所提出的技术减少了22-25%的长尾延迟。
{"title":"Dynamic Management of Key States for Reinforcement Learning-assisted Garbage Collection to Reduce Long Tail Latency in SSD","authors":"Won-Kyung Kang, S. Yoo","doi":"10.1145/3195970.3196034","DOIUrl":"https://doi.org/10.1145/3195970.3196034","url":null,"abstract":"Garbage collection (GC) is one of main causes of the long-tail latency problem in storage systems. Long-tail latency due to GC is more than 100 times greater than the average latency at the 99th percentile. Therefore, due to such a long tail latency, real-time systems and quality-critical systems cannot meet the system requirements. In this study, we propose a novel key state management technique of reinforcement learning-assisted garbage collection. The purpose of this study is to dynamically manage key states from a significant number of state candidates. Dynamic management enables us to utilize suitable and frequently recurring key states at a small area cost since the full states do not have to be managed. The experimental results show that the proposed technique reduces by 22–25% the long-tail latency compared to a state-of-the-art scheme with real-world workloads.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"76 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76905046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
CMP-PIM: An Energy-Efficient Comparator-based Processing-In-Memory Neural Network Accelerator CMP-PIM:一种基于比较器的高效内存处理神经网络加速器
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196009
Shaahin Angizi, Zhezhi He, A. S. Rakin, Deliang Fan
In this paper, an energy-efficient and high-speed comparator-based processing-in-memory accelerator (CMP-PIM) is proposed to efficiently execute a novel hardware-oriented comparator-based deep neural network called CMPNET. Inspired by local binary pattern feature extraction method combined with depthwise separable convolution, we first modify the existing Convolutional Neural Network (CNN) algorithm by replacing the computationally-intensive multiplications in convolution layers with more efficient and less complex comparison and addition. Then, we propose a CMP-PIM that employs parallel computational memory sub-array as a fundamental processing unit based on SOT-MRAM. We compare CMP-PIM accelerator performance on different data-sets with recent CNN accelerator designs. With the close inference accuracy on SVHN data-set, CMP-PIM can get ~ 94× and 3× better energy efficiency compared to CNN and Local Binary CNN (LBCNN), respectively. Besides, it achieves 4.3× speed-up compared to CNN-baseline with identical network configuration.
本文提出了一种高效、高速的基于比较器的内存处理加速器(CMP-PIM),用于高效执行一种新型的基于硬件比较器的深度神经网络CMPNET。受局部二值模式特征提取方法与深度可分卷积相结合的启发,我们首先对现有的卷积神经网络(CNN)算法进行了改进,将卷积层中计算量大的乘法替换为更高效、更简单的比较和加法。然后,我们提出了一种基于SOT-MRAM的并行计算存储器子阵列作为基本处理单元的CMP-PIM。我们比较了CMP-PIM加速器在不同数据集上的性能和最近的CNN加速器设计。CMP-PIM在SVHN数据集上的推理精度接近,比CNN和Local Binary CNN (LBCNN)分别提高了94倍和3倍的能效。在相同的网络配置下,与CNN-baseline相比提速4.3倍。
{"title":"CMP-PIM: An Energy-Efficient Comparator-based Processing-In-Memory Neural Network Accelerator","authors":"Shaahin Angizi, Zhezhi He, A. S. Rakin, Deliang Fan","doi":"10.1145/3195970.3196009","DOIUrl":"https://doi.org/10.1145/3195970.3196009","url":null,"abstract":"In this paper, an energy-efficient and high-speed comparator-based processing-in-memory accelerator (CMP-PIM) is proposed to efficiently execute a novel hardware-oriented comparator-based deep neural network called CMPNET. Inspired by local binary pattern feature extraction method combined with depthwise separable convolution, we first modify the existing Convolutional Neural Network (CNN) algorithm by replacing the computationally-intensive multiplications in convolution layers with more efficient and less complex comparison and addition. Then, we propose a CMP-PIM that employs parallel computational memory sub-array as a fundamental processing unit based on SOT-MRAM. We compare CMP-PIM accelerator performance on different data-sets with recent CNN accelerator designs. With the close inference accuracy on SVHN data-set, CMP-PIM can get ~ 94× and 3× better energy efficiency compared to CNN and Local Binary CNN (LBCNN), respectively. Besides, it achieves 4.3× speed-up compared to CNN-baseline with identical network configuration.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"27 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86043124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Extensive Evaluation of Programming Models and ISAs Impact on Multicore So Error Reliability 编程模型和isa对多核So错误可靠性影响的广泛评估
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196050
F. Rosa, Vitor V. Bandeira, R. Reis, Luciano Ost
To take advantage of the performance enhancements provided by multicore processors, new instruction set architectures (ISAs) and parallel programming libraries have been investigated across multiple industrial segments. This paper investigates the impact of parallelization libraries and distinct ISAs on the soft error reliability of two multicore ARM processor models (i.e., Cortex-A9 and Cortex-A72), running Linux Kernel and benchmarks with up to 87 billion instructions. An extensive soft error evaluation with more than 1.2 million simulation hours, considering ARMv7 and ARMv8 ISAs and the NAS Parallel Benchmark (NPB) suite is presented.
为了利用多核处理器提供的性能增强,已经在多个工业领域研究了新的指令集体系结构(isa)和并行编程库。本文研究了并行化库和不同isa对两种多核ARM处理器模型(即Cortex-A9和Cortex-A72)的软错误可靠性的影响,运行Linux内核和基准测试,指令量高达870亿。在ARMv7和ARMv8 isa以及NAS并行基准测试(NPB)套件的基础上,进行了超过120万小时的软误差评估。
{"title":"Extensive Evaluation of Programming Models and ISAs Impact on Multicore So Error Reliability","authors":"F. Rosa, Vitor V. Bandeira, R. Reis, Luciano Ost","doi":"10.1145/3195970.3196050","DOIUrl":"https://doi.org/10.1145/3195970.3196050","url":null,"abstract":"To take advantage of the performance enhancements provided by multicore processors, new instruction set architectures (ISAs) and parallel programming libraries have been investigated across multiple industrial segments. This paper investigates the impact of parallelization libraries and distinct ISAs on the soft error reliability of two multicore ARM processor models (i.e., Cortex-A9 and Cortex-A72), running Linux Kernel and benchmarks with up to 87 billion instructions. An extensive soft error evaluation with more than 1.2 million simulation hours, considering ARMv7 and ARMv8 ISAs and the NAS Parallel Benchmark (NPB) suite is presented.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"75 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88786574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Efficient Batch Statistical Error Estimation for Iterative Multi-level Approximate Logic Synthesis 迭代多级近似逻辑综合的有效批量统计误差估计
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196038
Sanbao Su, Yi Wu, Weikang Qian
Approximate computing is an emerging energy-efficient paradigm for error-resilient applications. Approximate logic synthesis (ALS) is an important field of it. To improve the existing ALS flows, one key issue is to derive a more accurate and efficient batch error estimation technique for all approximate transformations under consideration. In this work, we propose a novel batch error estimation method based on Monte Carlo simulation and local change propagation. It is generally applicable to any statistical error measurement such as error rate and average error magnitude. We applied the technique to an existing state-of-the-art ALS approach and demonstrated its effectiveness in deriving better approximate circuits.
近似计算是一种用于容错应用的新兴节能范例。近似逻辑综合(ALS)是其中的一个重要领域。为了改进现有的渐近变换流程,一个关键问题是对所考虑的所有近似变换推导出一种更准确、更有效的批量误差估计技术。在这项工作中,我们提出了一种新的基于蒙特卡罗模拟和局部变化传播的批量误差估计方法。一般适用于误差率、平均误差大小等任何统计误差测量。我们将该技术应用于现有的最先进的ALS方法,并证明了其在推导更好的近似电路方面的有效性。
{"title":"Efficient Batch Statistical Error Estimation for Iterative Multi-level Approximate Logic Synthesis","authors":"Sanbao Su, Yi Wu, Weikang Qian","doi":"10.1145/3195970.3196038","DOIUrl":"https://doi.org/10.1145/3195970.3196038","url":null,"abstract":"Approximate computing is an emerging energy-efficient paradigm for error-resilient applications. Approximate logic synthesis (ALS) is an important field of it. To improve the existing ALS flows, one key issue is to derive a more accurate and efficient batch error estimation technique for all approximate transformations under consideration. In this work, we propose a novel batch error estimation method based on Monte Carlo simulation and local change propagation. It is generally applicable to any statistical error measurement such as error rate and average error magnitude. We applied the technique to an existing state-of-the-art ALS approach and demonstrated its effectiveness in deriving better approximate circuits.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"11 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85232613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A Neuromorphic Design Using Chaotic Mott Memristor with Relaxation Oscillation 基于松弛振荡的混沌Mott忆阻器的神经形态设计
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3195977
Bonan Yan, Xiong Cao, Hai Li
The recent proposed nanoscale Mott memristor features negative differential resistance and chaotic dynamics. This work proposes a novel neuromorphic computing system that utilizes Mott memristors to simplify peripheral circuitry. According to the analytic description of chaotic dynamics and relaxation oscillation, we carefully tune the working point of Mott memristors to balance the chaotic behavior weighing testing accuracy and training efficiency. Compared with conventional designs, the proposed design accelerates the training by 1.893× averagely and saves 27.68% and 43.32% power consumption with 36.67% and 26.75% less area for single-layer and two-layer perceptrons, respectively.
最近提出的纳米Mott忆阻器具有负差分电阻和混沌动力学特性。这项工作提出了一个新的神经形态计算系统,利用莫特忆阻器来简化外围电路。根据混沌动力学和弛豫振荡的分析描述,精心调整Mott记忆电阻器的工作点,以平衡混沌行为,衡量测试精度和训练效率。与传统设计相比,本文设计的训练速度平均提高了1.893倍,功耗节省了27.68%和43.32%,单层和双层感知器的面积分别减少了36.67%和26.75%。
{"title":"A Neuromorphic Design Using Chaotic Mott Memristor with Relaxation Oscillation","authors":"Bonan Yan, Xiong Cao, Hai Li","doi":"10.1145/3195970.3195977","DOIUrl":"https://doi.org/10.1145/3195970.3195977","url":null,"abstract":"The recent proposed nanoscale Mott memristor features negative differential resistance and chaotic dynamics. This work proposes a novel neuromorphic computing system that utilizes Mott memristors to simplify peripheral circuitry. According to the analytic description of chaotic dynamics and relaxation oscillation, we carefully tune the working point of Mott memristors to balance the chaotic behavior weighing testing accuracy and training efficiency. Compared with conventional designs, the proposed design accelerates the training by 1.893× averagely and saves 27.68% and 43.32% power consumption with 36.67% and 26.75% less area for single-layer and two-layer perceptrons, respectively.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"55 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79836166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Architecture Decomposition in System Synthesis of Heterogeneous Many-Core Systems 异构多核系统综合中的体系结构分解
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3195995
Valentina Richthammer, T. Schwarzer, S. Wildermann, J. Teich, Michael Glass
Determining feasible application mappings for Design Space Exploration (DSE) and run-time embedding is a challenge for modern many-core systems. The underlying NP-complete system-synthesis problem faces tremendously complex problem instances due to the hundreds of heterogeneous processing elements, their communication infrastructure, and the resulting number of mapping possibilities. Thus, we propose to employ a search-space splitting (SSS) technique using architecture decomposition to increase the performance of existing design-time and run-time synthesis approaches. The technique first restricts the search for application embeddings to selected sub-architectures at substantially reduced complexity; therefore, the complete architecture needs to be searched only in case no embedding is found on any sub-system. Furthermore, we introduce a basic learning mechanism to detect promising sub-architectures and subsequently restrict the search to those. We exemplify the SSS for a SAT-based and a problem-specific backtracking-based system synthesis as part of DSE for NoC-based many-core systems. Experimental results show drastically reduced execution times (≈ 15–50 × on a 24×24 architecture) and an enhanced quality of the embedding, since less mappings (≈ 20–40 ×, compared to the non-decomposing procedures) need to be discarded due to a timeout.
为设计空间探索(DSE)和运行时嵌入确定可行的应用映射是现代多核系统面临的一个挑战。由于数以百计的异构处理元素、它们的通信基础设施以及由此产生的映射可能性的数量,潜在的np完全系统综合问题面临着极其复杂的问题实例。因此,我们建议使用架构分解的搜索空间分割(SSS)技术来提高现有设计时和运行时综合方法的性能。该技术首先将应用嵌入的搜索限制在选定的子体系结构中,大大降低了复杂性;因此,只有在没有在任何子系统上找到嵌入的情况下,才需要搜索完整的体系结构。此外,我们引入了一种基本的学习机制来检测有前途的子体系结构,并随后将搜索限制在这些子体系结构上。我们举例说明了基于sat和基于特定问题回溯的系统综合的SSS,作为基于noc的多核系统的DSE的一部分。实验结果表明,由于由于超时而需要丢弃的映射更少(与非分解过程相比,≈20-40 x),因此大大减少了执行时间(在24×24架构上≈15-50 x)并提高了嵌入质量。
{"title":"Architecture Decomposition in System Synthesis of Heterogeneous Many-Core Systems","authors":"Valentina Richthammer, T. Schwarzer, S. Wildermann, J. Teich, Michael Glass","doi":"10.1145/3195970.3195995","DOIUrl":"https://doi.org/10.1145/3195970.3195995","url":null,"abstract":"Determining feasible application mappings for Design Space Exploration (DSE) and run-time embedding is a challenge for modern many-core systems. The underlying NP-complete system-synthesis problem faces tremendously complex problem instances due to the hundreds of heterogeneous processing elements, their communication infrastructure, and the resulting number of mapping possibilities. Thus, we propose to employ a search-space splitting (SSS) technique using architecture decomposition to increase the performance of existing design-time and run-time synthesis approaches. The technique first restricts the search for application embeddings to selected sub-architectures at substantially reduced complexity; therefore, the complete architecture needs to be searched only in case no embedding is found on any sub-system. Furthermore, we introduce a basic learning mechanism to detect promising sub-architectures and subsequently restrict the search to those. We exemplify the SSS for a SAT-based and a problem-specific backtracking-based system synthesis as part of DSE for NoC-based many-core systems. Experimental results show drastically reduced execution times (≈ 15–50 × on a 24×24 architecture) and an enhanced quality of the embedding, since less mappings (≈ 20–40 ×, compared to the non-decomposing procedures) need to be discarded due to a timeout.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"19 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79942829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Noise-Aware DVFS Transition Sequence Optimization for Battery-Powered IoT Devices 电池供电物联网设备的噪声感知DVFS转换序列优化
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196080
Shaoheng Luo, Cheng Zhuo, H. Gan
Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT) devices, which are well known for their bursty workloads and limited energy storage — usually in the form of tiny batteries. To ensure battery lifetime, DVFS has become an essential technique in such SoC chips. With continuously decreasing supply level, noise margins in these devices are already being squeezed. During DVFS transition, large current that accompanies the clock speed transition runs into or out of clock networks in a few clock cycles, and induces large Ldi/dt noise, thereby stressing the power delivery network (PDN). Due to the limited area and cost target, adding additional decap to mitigate such noise is usually challenging. A common approach is to gradually introduce/remove the additional clock cycles to increase or reduce the clock frequency in steps, a.k.a., clock skipping. However, such a technique may increase DVFS transition time, and still cannot guarantee minimal noise. In this work, we propose a new noise-aware DVFS sequence optimization technique by formulating a mixed 0/1 programming to resolve the problems of clock skipping sequence optimization. Moreover, the method is also extended to schedule extensive wake-up activities on different clock domains for the same purpose. The results show that we are able to achieve minimal-noise sequence within desired transition time with 53% noise reduction and save more than 15–17% power compared with the traditional approach.
低功耗片上系统(soc)现在是物联网(IoT)设备的核心,众所周知,物联网设备以其突发工作负载和有限的能量存储(通常以微型电池的形式)而闻名。为了确保电池寿命,DVFS已经成为这类SoC芯片的基本技术。随着供应水平的不断下降,这些设备的噪声边际已经被压缩了。在DVFS转换过程中,伴随着时钟速度转换的大电流在几个时钟周期内进入或流出时钟网络,并产生较大的Ldi/dt噪声,从而对PDN (power delivery network)造成压力。由于有限的面积和成本目标,增加额外的封头来减轻这种噪音通常是具有挑战性的。一种常见的方法是逐步引入/移除额外的时钟周期,以逐步增加或减少时钟频率,也称为时钟跳变。然而,这种技术可能会增加DVFS的转换时间,并且仍然不能保证最小的噪声。本文提出了一种新的噪声感知DVFS序列优化技术,通过制定混合0/1规划来解决时钟跳频序列优化问题。此外,该方法还被扩展到为同一目的在不同时钟域上调度大量唤醒活动。结果表明,与传统方法相比,该方法能够在期望的过渡时间内实现最小噪声序列,噪声降低53%,功耗节省15-17%以上。
{"title":"Noise-Aware DVFS Transition Sequence Optimization for Battery-Powered IoT Devices","authors":"Shaoheng Luo, Cheng Zhuo, H. Gan","doi":"10.1145/3195970.3196080","DOIUrl":"https://doi.org/10.1145/3195970.3196080","url":null,"abstract":"Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT) devices, which are well known for their bursty workloads and limited energy storage — usually in the form of tiny batteries. To ensure battery lifetime, DVFS has become an essential technique in such SoC chips. With continuously decreasing supply level, noise margins in these devices are already being squeezed. During DVFS transition, large current that accompanies the clock speed transition runs into or out of clock networks in a few clock cycles, and induces large Ldi/dt noise, thereby stressing the power delivery network (PDN). Due to the limited area and cost target, adding additional decap to mitigate such noise is usually challenging. A common approach is to gradually introduce/remove the additional clock cycles to increase or reduce the clock frequency in steps, a.k.a., clock skipping. However, such a technique may increase DVFS transition time, and still cannot guarantee minimal noise. In this work, we propose a new noise-aware DVFS sequence optimization technique by formulating a mixed 0/1 programming to resolve the problems of clock skipping sequence optimization. Moreover, the method is also extended to schedule extensive wake-up activities on different clock domains for the same purpose. The results show that we are able to achieve minimal-noise sequence within desired transition time with 53% noise reduction and save more than 15–17% power compared with the traditional approach.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"235 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87083034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Area-Optimized Low-Latency Approximate Multipliers for FPGA-based Hardware Accelerators 基于fpga硬件加速器的区域优化低延迟近似乘法器
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3195996
Salim Ullah, Semeen Rehman, B. Prabakaran, F. Kriebel, Muhammad Abdullah Hanif, M. Shafique, Akash Kumar
The architectural differences between ASICs and FPGAs limit the effective performance gains achievable by the application of ASIC-based approximation principles for FPGA-based reconfigurable computing systems. This paper presents a novel approximate multiplier architecture customized towards the FPGA-based fabrics, an efficient design methodology, and an open-source library. Our designs provide higher area, latency and energy gains along with better output accuracy than those offered by the state-of-the-art ASIC-based approximate multipliers. Moreover, compared to the multiplier IP offered by the Xilinx Vivado, our proposed design achieves up to 30%, 53%, and 67% gains in terms of area, latency, and energy, respectively, while incurring an insignificant accuracy loss (on average, below 1% average relative error). Our library of approximate multipliers is open-source and available online at https://cfaed.tudresden.de/pd-downloads to fuel further research and development in this area, and thereby enabling a new research direction for the FPGA community.
asic和fpga之间的架构差异限制了基于asic的近似原理在基于fpga的可重构计算系统中的应用所能获得的有效性能提升。本文提出了一种针对基于fpga的结构定制的新型近似乘法器架构,一种高效的设计方法和一个开源库。与最先进的基于asic的近似乘法器相比,我们的设计提供更高的面积、延迟和能量增益,以及更好的输出精度。此外,与Xilinx Vivado提供的乘法器IP相比,我们提出的设计在面积、延迟和能量方面分别实现了高达30%、53%和67%的增益,同时产生了微不足道的精度损失(平均而言,低于1%的平均相对误差)。我们的近似乘法器库是开源的,可以在https://cfaed.tudresden.de/pd-downloads上在线获得,以推动该领域的进一步研究和发展,从而为FPGA社区提供新的研究方向。
{"title":"Area-Optimized Low-Latency Approximate Multipliers for FPGA-based Hardware Accelerators","authors":"Salim Ullah, Semeen Rehman, B. Prabakaran, F. Kriebel, Muhammad Abdullah Hanif, M. Shafique, Akash Kumar","doi":"10.1145/3195970.3195996","DOIUrl":"https://doi.org/10.1145/3195970.3195996","url":null,"abstract":"The architectural differences between ASICs and FPGAs limit the effective performance gains achievable by the application of ASIC-based approximation principles for FPGA-based reconfigurable computing systems. This paper presents a novel approximate multiplier architecture customized towards the FPGA-based fabrics, an efficient design methodology, and an open-source library. Our designs provide higher area, latency and energy gains along with better output accuracy than those offered by the state-of-the-art ASIC-based approximate multipliers. Moreover, compared to the multiplier IP offered by the Xilinx Vivado, our proposed design achieves up to 30%, 53%, and 67% gains in terms of area, latency, and energy, respectively, while incurring an insignificant accuracy loss (on average, below 1% average relative error). Our library of approximate multipliers is open-source and available online at https://cfaed.tudresden.de/pd-downloads to fuel further research and development in this area, and thereby enabling a new research direction for the FPGA community.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"36 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80863855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Long Live TIME: Improving Lifetime for Training-In-Memory Engines by Structured Gradient Sparsification 长寿命时间:通过结构化梯度稀疏化提高记忆中训练引擎的寿命
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196071
Yi Cai, Yujun Lin, Lixue Xia, Xiaoming Chen, Song Han, Yu Wang, Huazhong Yang
Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While conventional CMOS-based computing platforms are hard to achieve higher energy efficiency. RRAM-based systems provide a promising solution to build efficient Training-In-Memory Engines (TIME). While the endurance of RRAM cells is limited, it’s a severe issue as the weights of NN always need to be updated for thousands to millions of times during training. Gradient sparsification can address this problem by dropping off most of the smaller gradients but introduce unacceptable computation cost. We proposed an effective framework, SGS-ARS, including Structured Gradient Sparsification (SGS) and Aging-aware Row Swapping (ARS) scheme, to guarantee write balance across whole RRAM crossbars and prolong the lifetime of TIME. Our experiments demonstrate that 356× lifetime extension is achieved when TIME is programmed to train ResNet-50 on Imagenet dataset with our SGS-ARS framework.
深度更大的神经网络(NNs)在许多领域取得了突破。而传统的基于cmos的计算平台很难实现更高的能效。基于ram的系统为构建高效的内存训练引擎(TIME)提供了一个很有前途的解决方案。虽然RRAM单元的寿命有限,但这是一个严重的问题,因为在训练过程中,神经网络的权重总是需要更新数千到数百万次。梯度稀疏化可以通过减少大多数较小的梯度来解决这个问题,但会引入不可接受的计算成本。我们提出了一种有效的框架SGS-ARS,包括结构化梯度稀疏(SGS)和老化感知行交换(ARS)方案,以保证整个RRAM交叉条的写平衡并延长TIME的生命周期。我们的实验表明,当使用我们的SGS-ARS框架编程TIME在Imagenet数据集上训练ResNet-50时,实现了356倍的寿命延长。
{"title":"Long Live TIME: Improving Lifetime for Training-In-Memory Engines by Structured Gradient Sparsification","authors":"Yi Cai, Yujun Lin, Lixue Xia, Xiaoming Chen, Song Han, Yu Wang, Huazhong Yang","doi":"10.1145/3195970.3196071","DOIUrl":"https://doi.org/10.1145/3195970.3196071","url":null,"abstract":"Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While conventional CMOS-based computing platforms are hard to achieve higher energy efficiency. RRAM-based systems provide a promising solution to build efficient Training-In-Memory Engines (TIME). While the endurance of RRAM cells is limited, it’s a severe issue as the weights of NN always need to be updated for thousands to millions of times during training. Gradient sparsification can address this problem by dropping off most of the smaller gradients but introduce unacceptable computation cost. We proposed an effective framework, SGS-ARS, including Structured Gradient Sparsification (SGS) and Aging-aware Row Swapping (ARS) scheme, to guarantee write balance across whole RRAM crossbars and prolong the lifetime of TIME. Our experiments demonstrate that 356× lifetime extension is achieved when TIME is programmed to train ResNet-50 on Imagenet dataset with our SGS-ARS framework.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88167624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
期刊
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1