Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.
{"title":"Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache","authors":"Martin Rapp, A. Pathania, J. Henkel","doi":"10.1145/3218603.3218630","DOIUrl":"https://doi.org/10.1145/3218603.3218630","url":null,"abstract":"Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88135008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100x100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.
{"title":"Breaking POps/J Barrier with Analog Multiplier Circuits Based on Nonvolatile Memories","authors":"M. Mahmoodi, D. Strukov","doi":"10.1145/3218603.3218613","DOIUrl":"https://doi.org/10.1145/3218603.3218613","url":null,"abstract":"Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100x100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75426556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The thermal issue for mobile devices becomes critical as the devices' performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.
{"title":"App-Oriented Thermal Management of Mobile Devices","authors":"Jihoon Park, Seokjun Lee, H. Cha","doi":"10.1145/3218603.3218622","DOIUrl":"https://doi.org/10.1145/3218603.3218622","url":null,"abstract":"The thermal issue for mobile devices becomes critical as the devices' performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73987310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Thirumala, Arnab Raha, H. Jayakumar, Kaisheng Ma, N. Vijaykrishnan, V. Raghunathan, S. Gupta
In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.
{"title":"Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems","authors":"S. Thirumala, Arnab Raha, H. Jayakumar, Kaisheng Ma, N. Vijaykrishnan, V. Raghunathan, S. Gupta","doi":"10.1145/3218603.3218653","DOIUrl":"https://doi.org/10.1145/3218603.3218653","url":null,"abstract":"In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76117703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tahmoures Shabanian, Aatreyi Bal, Prabal Basu, Koushik Chakraborty, Sanghamitra Roy
The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.
{"title":"ACE-GPU: Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold Computing GPU","authors":"Tahmoures Shabanian, Aatreyi Bal, Prabal Basu, Koushik Chakraborty, Sanghamitra Roy","doi":"10.1145/3218603.3218644","DOIUrl":"https://doi.org/10.1145/3218603.3218644","url":null,"abstract":"The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79413019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch's resistance and flying-capacitor's density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch's resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.
{"title":"Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter","authors":"Dongkwun Kim, Mingoo Seok","doi":"10.1145/3218603.3218610","DOIUrl":"https://doi.org/10.1145/3218603.3218610","url":null,"abstract":"We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch's resistance and flying-capacitor's density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch's resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81040800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Srinivasan, B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan
The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the "exactness" constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environm
{"title":"Across the Stack Opportunities for Deep Learning Acceleration","authors":"V. Srinivasan, B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan","doi":"10.1145/3218603.3241339","DOIUrl":"https://doi.org/10.1145/3218603.3241339","url":null,"abstract":"The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the \"exactness\" constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environm","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90996465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Swaminathan Narayanaswamy, Sangyoung Park, S. Steinhorst, S. Chakraborty
Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.
{"title":"Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs","authors":"Swaminathan Narayanaswamy, Sangyoung Park, S. Steinhorst, S. Chakraborty","doi":"10.1145/3218603.3218607","DOIUrl":"https://doi.org/10.1145/3218603.3218607","url":null,"abstract":"Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83359856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency. In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.
{"title":"Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware","authors":"D. J. Pagliari, E. Macii, M. Poncino","doi":"10.1145/3218603.3218611","DOIUrl":"https://doi.org/10.1145/3218603.3218611","url":null,"abstract":"Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency. In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80438053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Srinivasa, A. Ramanathan, Xueqing Li, Wei-Hao Chen, F. Hsueh, Chih-Chao Yang, C. Shen, J. Shieh, S. Gupta, Meng-Fan Chang, Swaroop Ghosh, J. Sampson, N. Vijaykrishnan
We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.
{"title":"A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support","authors":"S. Srinivasa, A. Ramanathan, Xueqing Li, Wei-Hao Chen, F. Hsueh, Chih-Chao Yang, C. Shen, J. Shieh, S. Gupta, Meng-Fan Chang, Swaroop Ghosh, J. Sampson, N. Vijaykrishnan","doi":"10.1145/3218603.3218645","DOIUrl":"https://doi.org/10.1145/3218603.3218645","url":null,"abstract":"We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87639334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}