The Learning with Errors (LWE) problem is a novel foundation of a variety of cryptographic applications, including quantumly-secure public-key encryption, digital signature, and fully homomorphic encryption. In this work, we propose an approximate decryption technique for LWE-based cryptosystems. Based on the fact that the decryption process for such systems is inherently approximate, we apply hardware-based approximate computing techniques. Rigorous experiments have shown that the proposed technique simultaneously achieved 1.3× (resp., 2.5×) speed increase, 2.06× (resp., 7.89×) area reduction, 20.5% (resp., 4×) of power reduction, and an average of 27.1% (resp., 65.6%) ciphertext size reduction for public-key encryption scheme (resp., a state-of-the-art fully homomorphic encryption scheme).
{"title":"DWE: Decrypting Learning with Errors with Errors","authors":"S. Bian, Masayuki Hiromoto, Takashi Sato","doi":"10.1145/3195970.3196032","DOIUrl":"https://doi.org/10.1145/3195970.3196032","url":null,"abstract":"The Learning with Errors (LWE) problem is a novel foundation of a variety of cryptographic applications, including quantumly-secure public-key encryption, digital signature, and fully homomorphic encryption. In this work, we propose an approximate decryption technique for LWE-based cryptosystems. Based on the fact that the decryption process for such systems is inherently approximate, we apply hardware-based approximate computing techniques. Rigorous experiments have shown that the proposed technique simultaneously achieved 1.3× (resp., 2.5×) speed increase, 2.06× (resp., 7.89×) area reduction, 20.5% (resp., 4×) of power reduction, and an average of 27.1% (resp., 65.6%) ciphertext size reduction for public-key encryption scheme (resp., a state-of-the-art fully homomorphic encryption scheme).","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"25 7 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83946161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Garbage collection (GC) is one of main causes of the long-tail latency problem in storage systems. Long-tail latency due to GC is more than 100 times greater than the average latency at the 99th percentile. Therefore, due to such a long tail latency, real-time systems and quality-critical systems cannot meet the system requirements. In this study, we propose a novel key state management technique of reinforcement learning-assisted garbage collection. The purpose of this study is to dynamically manage key states from a significant number of state candidates. Dynamic management enables us to utilize suitable and frequently recurring key states at a small area cost since the full states do not have to be managed. The experimental results show that the proposed technique reduces by 22–25% the long-tail latency compared to a state-of-the-art scheme with real-world workloads.
{"title":"Dynamic Management of Key States for Reinforcement Learning-assisted Garbage Collection to Reduce Long Tail Latency in SSD","authors":"Won-Kyung Kang, S. Yoo","doi":"10.1145/3195970.3196034","DOIUrl":"https://doi.org/10.1145/3195970.3196034","url":null,"abstract":"Garbage collection (GC) is one of main causes of the long-tail latency problem in storage systems. Long-tail latency due to GC is more than 100 times greater than the average latency at the 99th percentile. Therefore, due to such a long tail latency, real-time systems and quality-critical systems cannot meet the system requirements. In this study, we propose a novel key state management technique of reinforcement learning-assisted garbage collection. The purpose of this study is to dynamically manage key states from a significant number of state candidates. Dynamic management enables us to utilize suitable and frequently recurring key states at a small area cost since the full states do not have to be managed. The experimental results show that the proposed technique reduces by 22–25% the long-tail latency compared to a state-of-the-art scheme with real-world workloads.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"76 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76905046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaahin Angizi, Zhezhi He, A. S. Rakin, Deliang Fan
In this paper, an energy-efficient and high-speed comparator-based processing-in-memory accelerator (CMP-PIM) is proposed to efficiently execute a novel hardware-oriented comparator-based deep neural network called CMPNET. Inspired by local binary pattern feature extraction method combined with depthwise separable convolution, we first modify the existing Convolutional Neural Network (CNN) algorithm by replacing the computationally-intensive multiplications in convolution layers with more efficient and less complex comparison and addition. Then, we propose a CMP-PIM that employs parallel computational memory sub-array as a fundamental processing unit based on SOT-MRAM. We compare CMP-PIM accelerator performance on different data-sets with recent CNN accelerator designs. With the close inference accuracy on SVHN data-set, CMP-PIM can get ~ 94× and 3× better energy efficiency compared to CNN and Local Binary CNN (LBCNN), respectively. Besides, it achieves 4.3× speed-up compared to CNN-baseline with identical network configuration.
{"title":"CMP-PIM: An Energy-Efficient Comparator-based Processing-In-Memory Neural Network Accelerator","authors":"Shaahin Angizi, Zhezhi He, A. S. Rakin, Deliang Fan","doi":"10.1145/3195970.3196009","DOIUrl":"https://doi.org/10.1145/3195970.3196009","url":null,"abstract":"In this paper, an energy-efficient and high-speed comparator-based processing-in-memory accelerator (CMP-PIM) is proposed to efficiently execute a novel hardware-oriented comparator-based deep neural network called CMPNET. Inspired by local binary pattern feature extraction method combined with depthwise separable convolution, we first modify the existing Convolutional Neural Network (CNN) algorithm by replacing the computationally-intensive multiplications in convolution layers with more efficient and less complex comparison and addition. Then, we propose a CMP-PIM that employs parallel computational memory sub-array as a fundamental processing unit based on SOT-MRAM. We compare CMP-PIM accelerator performance on different data-sets with recent CNN accelerator designs. With the close inference accuracy on SVHN data-set, CMP-PIM can get ~ 94× and 3× better energy efficiency compared to CNN and Local Binary CNN (LBCNN), respectively. Besides, it achieves 4.3× speed-up compared to CNN-baseline with identical network configuration.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"27 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86043124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To take advantage of the performance enhancements provided by multicore processors, new instruction set architectures (ISAs) and parallel programming libraries have been investigated across multiple industrial segments. This paper investigates the impact of parallelization libraries and distinct ISAs on the soft error reliability of two multicore ARM processor models (i.e., Cortex-A9 and Cortex-A72), running Linux Kernel and benchmarks with up to 87 billion instructions. An extensive soft error evaluation with more than 1.2 million simulation hours, considering ARMv7 and ARMv8 ISAs and the NAS Parallel Benchmark (NPB) suite is presented.
{"title":"Extensive Evaluation of Programming Models and ISAs Impact on Multicore So Error Reliability","authors":"F. Rosa, Vitor V. Bandeira, R. Reis, Luciano Ost","doi":"10.1145/3195970.3196050","DOIUrl":"https://doi.org/10.1145/3195970.3196050","url":null,"abstract":"To take advantage of the performance enhancements provided by multicore processors, new instruction set architectures (ISAs) and parallel programming libraries have been investigated across multiple industrial segments. This paper investigates the impact of parallelization libraries and distinct ISAs on the soft error reliability of two multicore ARM processor models (i.e., Cortex-A9 and Cortex-A72), running Linux Kernel and benchmarks with up to 87 billion instructions. An extensive soft error evaluation with more than 1.2 million simulation hours, considering ARMv7 and ARMv8 ISAs and the NAS Parallel Benchmark (NPB) suite is presented.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"75 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88786574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Approximate computing is an emerging energy-efficient paradigm for error-resilient applications. Approximate logic synthesis (ALS) is an important field of it. To improve the existing ALS flows, one key issue is to derive a more accurate and efficient batch error estimation technique for all approximate transformations under consideration. In this work, we propose a novel batch error estimation method based on Monte Carlo simulation and local change propagation. It is generally applicable to any statistical error measurement such as error rate and average error magnitude. We applied the technique to an existing state-of-the-art ALS approach and demonstrated its effectiveness in deriving better approximate circuits.
{"title":"Efficient Batch Statistical Error Estimation for Iterative Multi-level Approximate Logic Synthesis","authors":"Sanbao Su, Yi Wu, Weikang Qian","doi":"10.1145/3195970.3196038","DOIUrl":"https://doi.org/10.1145/3195970.3196038","url":null,"abstract":"Approximate computing is an emerging energy-efficient paradigm for error-resilient applications. Approximate logic synthesis (ALS) is an important field of it. To improve the existing ALS flows, one key issue is to derive a more accurate and efficient batch error estimation technique for all approximate transformations under consideration. In this work, we propose a novel batch error estimation method based on Monte Carlo simulation and local change propagation. It is generally applicable to any statistical error measurement such as error rate and average error magnitude. We applied the technique to an existing state-of-the-art ALS approach and demonstrated its effectiveness in deriving better approximate circuits.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"11 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85232613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent proposed nanoscale Mott memristor features negative differential resistance and chaotic dynamics. This work proposes a novel neuromorphic computing system that utilizes Mott memristors to simplify peripheral circuitry. According to the analytic description of chaotic dynamics and relaxation oscillation, we carefully tune the working point of Mott memristors to balance the chaotic behavior weighing testing accuracy and training efficiency. Compared with conventional designs, the proposed design accelerates the training by 1.893× averagely and saves 27.68% and 43.32% power consumption with 36.67% and 26.75% less area for single-layer and two-layer perceptrons, respectively.
{"title":"A Neuromorphic Design Using Chaotic Mott Memristor with Relaxation Oscillation","authors":"Bonan Yan, Xiong Cao, Hai Li","doi":"10.1145/3195970.3195977","DOIUrl":"https://doi.org/10.1145/3195970.3195977","url":null,"abstract":"The recent proposed nanoscale Mott memristor features negative differential resistance and chaotic dynamics. This work proposes a novel neuromorphic computing system that utilizes Mott memristors to simplify peripheral circuitry. According to the analytic description of chaotic dynamics and relaxation oscillation, we carefully tune the working point of Mott memristors to balance the chaotic behavior weighing testing accuracy and training efficiency. Compared with conventional designs, the proposed design accelerates the training by 1.893× averagely and saves 27.68% and 43.32% power consumption with 36.67% and 26.75% less area for single-layer and two-layer perceptrons, respectively.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"55 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79836166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentina Richthammer, T. Schwarzer, S. Wildermann, J. Teich, Michael Glass
Determining feasible application mappings for Design Space Exploration (DSE) and run-time embedding is a challenge for modern many-core systems. The underlying NP-complete system-synthesis problem faces tremendously complex problem instances due to the hundreds of heterogeneous processing elements, their communication infrastructure, and the resulting number of mapping possibilities. Thus, we propose to employ a search-space splitting (SSS) technique using architecture decomposition to increase the performance of existing design-time and run-time synthesis approaches. The technique first restricts the search for application embeddings to selected sub-architectures at substantially reduced complexity; therefore, the complete architecture needs to be searched only in case no embedding is found on any sub-system. Furthermore, we introduce a basic learning mechanism to detect promising sub-architectures and subsequently restrict the search to those. We exemplify the SSS for a SAT-based and a problem-specific backtracking-based system synthesis as part of DSE for NoC-based many-core systems. Experimental results show drastically reduced execution times (≈ 15–50 × on a 24×24 architecture) and an enhanced quality of the embedding, since less mappings (≈ 20–40 ×, compared to the non-decomposing procedures) need to be discarded due to a timeout.
{"title":"Architecture Decomposition in System Synthesis of Heterogeneous Many-Core Systems","authors":"Valentina Richthammer, T. Schwarzer, S. Wildermann, J. Teich, Michael Glass","doi":"10.1145/3195970.3195995","DOIUrl":"https://doi.org/10.1145/3195970.3195995","url":null,"abstract":"Determining feasible application mappings for Design Space Exploration (DSE) and run-time embedding is a challenge for modern many-core systems. The underlying NP-complete system-synthesis problem faces tremendously complex problem instances due to the hundreds of heterogeneous processing elements, their communication infrastructure, and the resulting number of mapping possibilities. Thus, we propose to employ a search-space splitting (SSS) technique using architecture decomposition to increase the performance of existing design-time and run-time synthesis approaches. The technique first restricts the search for application embeddings to selected sub-architectures at substantially reduced complexity; therefore, the complete architecture needs to be searched only in case no embedding is found on any sub-system. Furthermore, we introduce a basic learning mechanism to detect promising sub-architectures and subsequently restrict the search to those. We exemplify the SSS for a SAT-based and a problem-specific backtracking-based system synthesis as part of DSE for NoC-based many-core systems. Experimental results show drastically reduced execution times (≈ 15–50 × on a 24×24 architecture) and an enhanced quality of the embedding, since less mappings (≈ 20–40 ×, compared to the non-decomposing procedures) need to be discarded due to a timeout.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"19 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79942829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT) devices, which are well known for their bursty workloads and limited energy storage — usually in the form of tiny batteries. To ensure battery lifetime, DVFS has become an essential technique in such SoC chips. With continuously decreasing supply level, noise margins in these devices are already being squeezed. During DVFS transition, large current that accompanies the clock speed transition runs into or out of clock networks in a few clock cycles, and induces large Ldi/dt noise, thereby stressing the power delivery network (PDN). Due to the limited area and cost target, adding additional decap to mitigate such noise is usually challenging. A common approach is to gradually introduce/remove the additional clock cycles to increase or reduce the clock frequency in steps, a.k.a., clock skipping. However, such a technique may increase DVFS transition time, and still cannot guarantee minimal noise. In this work, we propose a new noise-aware DVFS sequence optimization technique by formulating a mixed 0/1 programming to resolve the problems of clock skipping sequence optimization. Moreover, the method is also extended to schedule extensive wake-up activities on different clock domains for the same purpose. The results show that we are able to achieve minimal-noise sequence within desired transition time with 53% noise reduction and save more than 15–17% power compared with the traditional approach.
{"title":"Noise-Aware DVFS Transition Sequence Optimization for Battery-Powered IoT Devices","authors":"Shaoheng Luo, Cheng Zhuo, H. Gan","doi":"10.1145/3195970.3196080","DOIUrl":"https://doi.org/10.1145/3195970.3196080","url":null,"abstract":"Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT) devices, which are well known for their bursty workloads and limited energy storage — usually in the form of tiny batteries. To ensure battery lifetime, DVFS has become an essential technique in such SoC chips. With continuously decreasing supply level, noise margins in these devices are already being squeezed. During DVFS transition, large current that accompanies the clock speed transition runs into or out of clock networks in a few clock cycles, and induces large Ldi/dt noise, thereby stressing the power delivery network (PDN). Due to the limited area and cost target, adding additional decap to mitigate such noise is usually challenging. A common approach is to gradually introduce/remove the additional clock cycles to increase or reduce the clock frequency in steps, a.k.a., clock skipping. However, such a technique may increase DVFS transition time, and still cannot guarantee minimal noise. In this work, we propose a new noise-aware DVFS sequence optimization technique by formulating a mixed 0/1 programming to resolve the problems of clock skipping sequence optimization. Moreover, the method is also extended to schedule extensive wake-up activities on different clock domains for the same purpose. The results show that we are able to achieve minimal-noise sequence within desired transition time with 53% noise reduction and save more than 15–17% power compared with the traditional approach.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"235 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87083034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salim Ullah, Semeen Rehman, B. Prabakaran, F. Kriebel, Muhammad Abdullah Hanif, M. Shafique, Akash Kumar
The architectural differences between ASICs and FPGAs limit the effective performance gains achievable by the application of ASIC-based approximation principles for FPGA-based reconfigurable computing systems. This paper presents a novel approximate multiplier architecture customized towards the FPGA-based fabrics, an efficient design methodology, and an open-source library. Our designs provide higher area, latency and energy gains along with better output accuracy than those offered by the state-of-the-art ASIC-based approximate multipliers. Moreover, compared to the multiplier IP offered by the Xilinx Vivado, our proposed design achieves up to 30%, 53%, and 67% gains in terms of area, latency, and energy, respectively, while incurring an insignificant accuracy loss (on average, below 1% average relative error). Our library of approximate multipliers is open-source and available online at https://cfaed.tudresden.de/pd-downloads to fuel further research and development in this area, and thereby enabling a new research direction for the FPGA community.
{"title":"Area-Optimized Low-Latency Approximate Multipliers for FPGA-based Hardware Accelerators","authors":"Salim Ullah, Semeen Rehman, B. Prabakaran, F. Kriebel, Muhammad Abdullah Hanif, M. Shafique, Akash Kumar","doi":"10.1145/3195970.3195996","DOIUrl":"https://doi.org/10.1145/3195970.3195996","url":null,"abstract":"The architectural differences between ASICs and FPGAs limit the effective performance gains achievable by the application of ASIC-based approximation principles for FPGA-based reconfigurable computing systems. This paper presents a novel approximate multiplier architecture customized towards the FPGA-based fabrics, an efficient design methodology, and an open-source library. Our designs provide higher area, latency and energy gains along with better output accuracy than those offered by the state-of-the-art ASIC-based approximate multipliers. Moreover, compared to the multiplier IP offered by the Xilinx Vivado, our proposed design achieves up to 30%, 53%, and 67% gains in terms of area, latency, and energy, respectively, while incurring an insignificant accuracy loss (on average, below 1% average relative error). Our library of approximate multipliers is open-source and available online at https://cfaed.tudresden.de/pd-downloads to fuel further research and development in this area, and thereby enabling a new research direction for the FPGA community.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"36 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80863855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Cai, Yujun Lin, Lixue Xia, Xiaoming Chen, Song Han, Yu Wang, Huazhong Yang
Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While conventional CMOS-based computing platforms are hard to achieve higher energy efficiency. RRAM-based systems provide a promising solution to build efficient Training-In-Memory Engines (TIME). While the endurance of RRAM cells is limited, it’s a severe issue as the weights of NN always need to be updated for thousands to millions of times during training. Gradient sparsification can address this problem by dropping off most of the smaller gradients but introduce unacceptable computation cost. We proposed an effective framework, SGS-ARS, including Structured Gradient Sparsification (SGS) and Aging-aware Row Swapping (ARS) scheme, to guarantee write balance across whole RRAM crossbars and prolong the lifetime of TIME. Our experiments demonstrate that 356× lifetime extension is achieved when TIME is programmed to train ResNet-50 on Imagenet dataset with our SGS-ARS framework.
{"title":"Long Live TIME: Improving Lifetime for Training-In-Memory Engines by Structured Gradient Sparsification","authors":"Yi Cai, Yujun Lin, Lixue Xia, Xiaoming Chen, Song Han, Yu Wang, Huazhong Yang","doi":"10.1145/3195970.3196071","DOIUrl":"https://doi.org/10.1145/3195970.3196071","url":null,"abstract":"Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While conventional CMOS-based computing platforms are hard to achieve higher energy efficiency. RRAM-based systems provide a promising solution to build efficient Training-In-Memory Engines (TIME). While the endurance of RRAM cells is limited, it’s a severe issue as the weights of NN always need to be updated for thousands to millions of times during training. Gradient sparsification can address this problem by dropping off most of the smaller gradients but introduce unacceptable computation cost. We proposed an effective framework, SGS-ARS, including Structured Gradient Sparsification (SGS) and Aging-aware Row Swapping (ARS) scheme, to guarantee write balance across whole RRAM crossbars and prolong the lifetime of TIME. Our experiments demonstrate that 356× lifetime extension is achieved when TIME is programmed to train ResNet-50 on Imagenet dataset with our SGS-ARS framework.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88167624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}