Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)最新文献_第5页

Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache 分布式共享最后一级缓存的多核pareto最优功率和缓存感知任务映射

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218630

Martin Rapp, A. Pathania, J. Henkel

Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.

有两个因素主要影响多核处理器上多线程任务的性能，包括共享和物理分布式的最后一级缓存(Last-Level Cache, LLC):与特定任务映射相关的功率预算(旨在保证热安全操作)和运行在不同核上的线程的不均匀LLC访问延迟。跨多核空间分布线程增加了功率预算，但不幸的是也增加了相关的LLC延迟。另一方面，将更多的线程映射到多核中心附近的核可以减少LLC延迟，但不幸的是也会减少功耗预算。因此，两个指标(LLC延迟和功率预算)不能同时优化，这导致了以前未被利用的帕累托优化。我们首先提出了一种称为PCMap的运行时任务映射算法，该算法利用了这种权衡。与最先进的方法相比，我们的方法使平均任务响应时间减少了8.6%，同时能耗减少了8.5%。

{"title":"Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache","authors":"Martin Rapp, A. Pathania, J. Henkel","doi":"10.1145/3218603.3218630","DOIUrl":"https://doi.org/10.1145/3218603.3218630","url":null,"abstract":"Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88135008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Breaking POps/J Barrier with Analog Multiplier Circuits Based on Nonvolatile Memories 基于非易失性存储器的模拟倍增电路打破POps/J势垒

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218613

M. Mahmoodi, D. Strukov

Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100x100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.

与数字乘法器相比，中低分辨率模拟向量乘法器(vmm)具有显著的能量/面积效率。尽管如此，模拟vmm所能达到的最大性能常常受到外围电路开销的限制。本文的主要贡献是设计了新的传感电路，提高了模拟乘法器的能量效率和密度。该电路是基于线性吉尔伯特单元，由一个浮动非线性电阻和一个低增益放大器拓扑组合而成。采用了几种补偿技术来确保与工艺、温度和电源电压变化有关的可靠性。作为案例研究，我们考虑用嵌入式分闸NOR快闪记忆体实现双门电流模VMM。仿真结果表明，采用55 nm CMOS技术设计的4位100x100 VMM电路达到了创纪录的3.63 POps/J的性能。

引用次数: 6

App-Oriented Thermal Management of Mobile Devices 面向应用的移动设备热管理

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218622

Jihoon Park, Seokjun Lee, H. Cha

The thermal issue for mobile devices becomes critical as the devices' performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.

随着设备性能的提高，以处理复杂的应用程序，移动设备的热问题变得至关重要。传统的热管理限制了整个设备的性能，降低了前景和背景应用的质量。这是不可取的，因为前台应用程序的质量，即每秒帧数(FPS)，直接受到影响，而用户通常不知道后台应用程序的性能。在本文中，我们提出了一种面向应用程序的热管理方案，该方案专门限制后台应用程序以保持前台应用程序的FPS。为了有效的热管理，我们开发了一个基于硬件利用率的模型来预测单个应用程序的热贡献。该系统根据每个后台应用的热贡献逐步限制系统资源。该方案在Galaxy S8+智能手机上实施，并通过全面的评估验证了其实用性。

引用次数: 5

Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems 间歇供电系统中基于双模铁电晶体管的非易失性触发器

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218653

S. Thirumala, Arnab Raha, H. Jayakumar, Kaisheng Ma, N. Vijaykrishnan, V. Raghunathan, S. Gupta

In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.

在这项工作中，我们提出了双模铁电晶体管(d - fefet)，它在控制信号的帮助下在易失性和非易失性模式之间表现出动态调谐。我们利用d - ffet的独特特性设计了两种非易失性触发器(nvff)。在这两种设计中，d - fet在正常工作时工作在易失性模式，在断电时工作在非易失性模式以备份触发器的状态。第一种设计包括一个真正的嵌入式非易失性元件(D-FEFET)，可实现全自动备份操作。在第二种设计中，我们引入了基于需求的备份，与第一种设计相比，它以面积为代价降低了正常运行时的能量。与先前提出的基于FEFET的NVFF相比，第一种设计实现了19%的面积减少，96%的备用能量减少，9%的恢复能量减少，但工作能量增加了14%-35%。第二种方案面积降低11%，备用能量降低21%，备用延迟和类似运行能量降低16%，但恢复能量和延迟分别损失17%和19%。在最先进的间歇性供电系统的背景下，使用实际基准对拟议的NVFFs进行系统级分析，可节省5%-33%的能源。

{"title":"Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems","authors":"S. Thirumala, Arnab Raha, H. Jayakumar, Kaisheng Ma, N. Vijaykrishnan, V. Raghunathan, S. Gupta","doi":"10.1145/3218603.3218653","DOIUrl":"https://doi.org/10.1145/3218603.3218653","url":null,"abstract":"In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76117703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

ACE-GPU: Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold Computing GPU ACE-GPU:解决近阈值计算GPU中瓶颈导致的性能瓶颈

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218644

Tahmoures Shabanian, Aatreyi Bal, Prabal Basu, Koushik Chakraborty, Sanghamitra Roy

The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.

具有严格热预算的多核器件的激增有助于近阈值计算(NTC)的研究。然而，图形处理单元(GPU)在NTC区域的操作仍然是未知的。在这项工作中，我们探讨了NTC的一个重要的可靠性困境，称为扼流点，它严重限制了gpu的性能。采用跨层方法，我们证明了在NTC区域操作的GPU中，瓶颈点在诱导时序误差方面的效力。我们提出了一个整体的电路架构解决方案，通过优雅地解决扼流点引起的时序错误，促进了节能的NTC-GPU设计范式。我们提出的方案在NTC-GPU性能和能量延迟产品方面分别提供了3.18倍和88.5%的改进，而不是最先进的时间误差缓解技术，具有边际面积和功率开销。

引用次数: 7

Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter 紧凑型集成开关电容DC-DC变换器优于最坏情况的设计方法

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218610

Dongkwun Kim, Mingoo Seok

We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch's resistance and flying-capacitor's density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch's resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.

我们提出了一种集成开关电容变换器和数字负载协同设计的新方法。传统上，负载被指定为最小电源电压和最大功耗，每个负载都在其自己的最坏情况下，工作负载和环境条件下找到。此外，在设计SC DC-DC变换器时，设计人员通常会对SC变换器的电源开关电阻和飞行电容器密度增加另一个单独的悲观假设。这种最坏情况下的设计方法可以导致一个显着过大的飞行电容器，从而限制片上集成的转换器。我们提出的方法采用比最坏情况(BTWC)更好的角度来避免过度设计，从而优化SC变换器的面积。具体来说，我们提出了BTWC负载建模，其中我们指定了非悲观的电源电压要求集和跨变量的负载功耗。此外，通过考虑SC变换器和集成在同一芯片中的负载之间的耦合变化，我们的方法可以进一步降低对功率开关电阻和电容器密度的悲观情绪。采用2:1 SC变换器和65nm的数字负载验证了所提出的协同设计方法。与传统的最坏情况设计相比，所得到的变换器在保持目标转换效率和目标吞吐量的同时，飞行电容器尺寸减少了一个数量级以上。我们还在电源电压和电流消耗方面对我们的方法进行了广泛的负载特性验证，并证实了类似的好处。

{"title":"Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter","authors":"Dongkwun Kim, Mingoo Seok","doi":"10.1145/3218603.3218610","DOIUrl":"https://doi.org/10.1145/3218603.3218610","url":null,"abstract":"We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch's resistance and flying-capacitor's density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch's resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81040800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Across the Stack Opportunities for Deep Learning Acceleration

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3241339

V. Srinivasan, B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the "exactness" constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environm

计算能力的增长和大数据集的可用性的结合导致了深度学习的重生。深度神经网络(dnn)已经成为跨越视觉、语音和机器翻译领域的各种机器学习任务的最先进技术。深度学习(DL)在这些任务中实现了很高的准确性，但代价是要花费100亿次的计算;对资源受限环境和数据中心的高效大规模部署提出了重大挑战。提高深度神经网络运行效率的关键因素之一是，当从大量结构化和非结构化数据中提取深度洞察时，不需要传统计算所施加的准确性。放松“精确性”约束可以利用跨系统堆栈的所有层进行近似计算的机会。在这次演讲中，我们提出了一个多tops AI核心[3]，用于加速从边缘设备到数据中心的系统中的深度学习训练和推理。我们证明，要从人工智能核心获得高持续利用率和能源效率，需要从头开始重新思考，以利用跨堆栈的近似计算，包括算法、架构、可编程性和硬件。模型精度是衡量深度学习质量的基本标准。我们AI核心的计算引擎精度经过仔细校准，以实现面积和功率的显着减少，同时不影响数值精度。我们在深度学习算法/应用程序级别的研究[2]表明，可以仔细地将权重和激活的精度调至低至2位的推理，并用于指导架构和硬件中支持的计算精度的选择，以用于训练和推理。类似地，分布式深度学习训练的可扩展性受到每个小批后交换梯度和权重的通信开销的影响。我们对梯度压缩的研究[1]表明，通过选择性地发送大于阈值的梯度，并根据梯度的重要性进一步选择阈值，我们可以在不损失模型精度的情况下，实现卷积层的压缩比为40X，网络的全连接层的压缩比高达200X。这些结果指导了使用AI核心构建的加速器系统互连网络拓扑探索的选择。总的来说，我们的工作显示了利用算法/应用程序的鲁棒性来容忍降低的精度和压缩数据通信的近似的好处，可以有效地与加速器的架构和硬件相结合，以支持这些降低精度的计算和压缩数据通信。我们的研究结果表明，DL加速器在不同指标(如高持续TOPs、高TOPs/watt和TOPs/mm2)上的端到端效率得到了提高，可用于训练和推理的不同操作环境。

{"title":"Across the Stack Opportunities for Deep Learning Acceleration","authors":"V. Srinivasan, B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan","doi":"10.1145/3218603.3241339","DOIUrl":"https://doi.org/10.1145/3218603.3241339","url":null,"abstract":"The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the \"exactness\" constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environm","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90996465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs 电池组多模式有源电池均衡架构及均衡策略

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218607

Swaminathan Narayanaswamy, Sangyoung Park, S. Steinhorst, S. Chakraborty

Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.

主动电池平衡是通过重新分配单个电池的充电水平来提高串联锂离子电池组的可用容量的过程。根据电池组中单个电池的荷电状态(SoC)分布，必须选择适当的电荷转移模式(电池到电池、电池到模块、模块到电池或模块到模块)，以提高电池组的可用能量。然而，现有的有源电池平衡电路只能执行有限数量的电荷转移模式，因此，对于不同类型的SoC分布，能量效率降低。在本文中，我们提出了一种模块化的多模式有源电池平衡架构，与现有解决方案相比，该架构能够执行多种类型的电荷转移模式(细胞到细胞，细胞到模块，模块到细胞和模块到模块)，减少了硬件组件和控制信号的数量。我们推导了我们提出的平衡架构的一个封闭形式的分析模型，我们用它来描述我们的架构所启用的单个电荷转移模式的效率。通过分析，我们提出了一种混合充电均衡策略，该策略根据电池组的SoC分布和我们提出的平衡架构的特性自动选择最节能的电荷转移模式。案例研究表明，与现有解决方案相比，我们提出的平衡架构和混合充电均衡策略可提供高达46.83%的能源效率改进。

{"title":"Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs","authors":"Swaminathan Narayanaswamy, Sangyoung Park, S. Steinhorst, S. Chakraborty","doi":"10.1145/3218603.3218607","DOIUrl":"https://doi.org/10.1145/3218603.3218607","url":null,"abstract":"Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83359856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware 节能深度学习硬件的动态位宽重构

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218611

D. J. Pagliari, E. Macii, M. Poncino

Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency. In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.

深度学习模型在许多机器学习任务中已经达到了最先进的性能。在能源、带宽、延迟等方面，可以直接在物联网终端节点内评估这些模型，而不是在云中评估。这就需要在资源有限的环境中实现低能耗的深度学习任务。研究人员和工业界最近对这些方面进行了调查，提出了用于低功耗深度学习的专用硬件加速器。在这些设备中采用的一种有效技术是减少计算的位宽，利用深度学习的错误恢复能力。然而，对于给定的模型，位宽度通常是静态设置的，而不考虑输入数据。除非对模型进行重新训练，否则这种解决方案总是为了能源效率而牺牲准确性。我们的方法基于全自动表征阶段，可以应用于流行的模型而无需再训练。

{"title":"Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware","authors":"D. J. Pagliari, E. Macii, M. Poncino","doi":"10.1145/3218603.3218611","DOIUrl":"https://doi.org/10.1145/3218603.3218611","url":null,"abstract":"Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency. In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80438053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support 具有增强鲁棒性和内存计算支持的单片三维SRAM设计

Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)

Pub Date : 2018-07-23 DOI: 10.1145/3218603.3218645

S. Srinivasa, A. Ramanathan, Xueqing Li, Wei-Hao Chen, F. Hsueh, Chih-Chao Yang, C. Shen, J. Shieh, S. Gupta, Meng-Fan Chang, Swaroop Ghosh, J. Sampson, N. Vijaykrishnan

We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.

我们提出了一种新的3D- sram单元，使用单片3D集成(M3D-IC)技术实现鲁棒性和内存布尔逻辑计算支持。提出的两层设计利用SRAM层上的额外晶体管来实现辅助技术以及提供逻辑功能(如AND/NAND, OR/NOR, XNOR/XOR)，而不会降低单元密度。通过分析，我们深入了解了三种内存辅助和两种逻辑模式所提供的好处，并评估了我们提出的设计的能源效率。辅助技术将SRAM的读取稳定性提高了2.2倍，并将写入裕量提高了17.6%，同时保持在SRAM的占用范围内。凭借增强的稳健性，电池可以在较低的电源电压下无缝运行，从而确保能源效率。能量延迟产品(EDP)比标准6T SRAM减少1.6倍，具有更快的数据访问。晶体管的放置及其在第二层的偏置技术使内存中的位布尔计算成为可能。当计算内存中的批量操作时，与在内存系统外计算相比，可以节省6.5倍的能源。

{"title":"A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support","authors":"S. Srinivasa, A. Ramanathan, Xueqing Li, Wei-Hao Chen, F. Hsueh, Chih-Chao Yang, C. Shen, J. Shieh, S. Gupta, Meng-Fan Chang, Swaroop Ghosh, J. Sampson, N. Vijaykrishnan","doi":"10.1145/3218603.3218645","DOIUrl":"https://doi.org/10.1145/3218603.3218645","url":null,"abstract":"We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"47 8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87639334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19