Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218652
Francesco Restuccia, Alessandro Biondi, Mauro Marinoni, Giorgiomaria Cicero, G. Buttazzo
FPGA-based system-on-chips (SoC) are powerful computing platforms to implement mixed-criticality systems that require both multiprocessing and hardware acceleration. Virtualization via hypervisor technologies is, de-facto, an effective technique to allow the co-existence of multiple execution domains with different criticality levels in isolation upon the same platform. Implementing such technologies on FPGA-based SoC poses new challenges: one of such is the isolation of hardware accelerators deployed on the FPGA fabric that belong to different domains but share common resources such as a memory bus. This paper proposes AXI HyperConnect, a hypervisor-level hardware component that allows interconnecting hardware accelerators to the same bus while ensuring isolation and predictability features. AXI HyperConnect has been implemented on modern FPGA-SoC by Xilinx and tested with real-world accelerators, including one for Deep Neural Network inference.
{"title":"AXI HyperConnect: A Predictable, Hypervisor-level Interconnect for Hardware Accelerators in FPGA SoC","authors":"Francesco Restuccia, Alessandro Biondi, Mauro Marinoni, Giorgiomaria Cicero, G. Buttazzo","doi":"10.1109/DAC18072.2020.9218652","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218652","url":null,"abstract":"FPGA-based system-on-chips (SoC) are powerful computing platforms to implement mixed-criticality systems that require both multiprocessing and hardware acceleration. Virtualization via hypervisor technologies is, de-facto, an effective technique to allow the co-existence of multiple execution domains with different criticality levels in isolation upon the same platform. Implementing such technologies on FPGA-based SoC poses new challenges: one of such is the isolation of hardware accelerators deployed on the FPGA fabric that belong to different domains but share common resources such as a memory bus. This paper proposes AXI HyperConnect, a hypervisor-level hardware component that allows interconnecting hardware accelerators to the same bus while ensuring isolation and predictability features. AXI HyperConnect has been implemented on modern FPGA-SoC by Xilinx and tested with real-world accelerators, including one for Deep Neural Network inference.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"52 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114027614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218597
Nils Heitmann, Philipp H. Kindt, S. Chakraborty
Hearing screening devices emit an acoustic signal in the outer ear, which invokes a specific response from a healthy inner ear. However, the high cost of such devices prevents widely deploying them in schools or private homes, especially in developing countries. In this paper, we for the first time show that such tests are also feasible with a device that consists of only one speaker for emitting the signal and using the same speaker – now as a microphone – for also recording the response. Existing devices rely on a speaker and microphone pair, which makes them significantly more complex and costly. We further outline the embedded systems and signal processing challenges that such a setup entails. If successful, it has the potential to make hearing screening available to a much wider population in developing countries.
{"title":"Late Breaking Results: Can You Hear Me? Towards an Ultra Low-Cost Hearing Screening Device","authors":"Nils Heitmann, Philipp H. Kindt, S. Chakraborty","doi":"10.1109/DAC18072.2020.9218597","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218597","url":null,"abstract":"Hearing screening devices emit an acoustic signal in the outer ear, which invokes a specific response from a healthy inner ear. However, the high cost of such devices prevents widely deploying them in schools or private homes, especially in developing countries. In this paper, we for the first time show that such tests are also feasible with a device that consists of only one speaker for emitting the signal and using the same speaker – now as a microphone – for also recording the response. Existing devices rely on a speaker and microphone pair, which makes them significantly more complex and costly. We further outline the embedded systems and signal processing challenges that such a setup entails. If successful, it has the potential to make hearing screening available to a much wider population in developing countries.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122852437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218641
Marcel Walter, R. Wille, F. Sill, Daniel Große, R. Drechsler
With the decline of Moore’s Law, several post-CMOS technologies are currently under heavy consideration. Promising candidates can be found in the class of Field-coupled Nanocomputing (FCN) devices as they allow for highest processing performance with tremendously low energy dissipation. With upcoming design automation in this domain, the need for formal verification approaches arises. Unfortunately, FCN circuits come with certain domain-specific properties that render conventional methods for the verification non-applicable. In this paper, we investigate this issue and propose a verification approach for FCN circuits that addresses this problem. For the first time, this provides researchers and engineers with an automatic method that allows them to check whether an obtained FCN circuit design indeed implements the given/desired function. A prototype implementation demonstrates the applicability of the proposed approach.
{"title":"Verification for Field-coupled Nanocomputing Circuits","authors":"Marcel Walter, R. Wille, F. Sill, Daniel Große, R. Drechsler","doi":"10.1109/DAC18072.2020.9218641","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218641","url":null,"abstract":"With the decline of Moore’s Law, several post-CMOS technologies are currently under heavy consideration. Promising candidates can be found in the class of Field-coupled Nanocomputing (FCN) devices as they allow for highest processing performance with tremendously low energy dissipation. With upcoming design automation in this domain, the need for formal verification approaches arises. Unfortunately, FCN circuits come with certain domain-specific properties that render conventional methods for the verification non-applicable. In this paper, we investigate this issue and propose a verification approach for FCN circuits that addresses this problem. For the first time, this provides researchers and engineers with an automatic method that allows them to check whether an obtained FCN circuit design indeed implements the given/desired function. A prototype implementation demonstrates the applicability of the proposed approach.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131378803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It has been widely demonstrated that the utilization of postdeployment trust evaluation approaches, such as side-channel measurements, along with statistical analysis methods is effective for detecting hardware Trojans in fabricated integrated circuits (ICs). However, more sophisticated Trojans proposed recently invalidate these methods with stealthy triggers and very-low side-channel signatures. Upon these challenges, in this paper, we propose an electromagnetic (EM) side-channel based post-fabrication trust evaluation framework which monitors EM radiations at runtime. The key component of the runtime trust evaluation framework is an on-chip EM sensor which can constantly measure and collect EM side-channel information of the target circuit. The simulation results validate the capability of the proposed framework in detecting stealthy hardware Trojans. Further, we fabricate an AES circuit protected by the proposed trust evaluation framework along with four different types of hardware Trojans. The measurements on the fabricated chips prove two key findings. First, the on-chip EM sensor can achieve a higher signal to noise ratio (SNR) and thus facilitate a better Trojan detection accuracy. Second, the trust evaluation framework can help detect different hardware Trojans at runtime.
{"title":"Runtime Trust Evaluation and Hardware Trojan Detection Using On-Chip EM Sensors","authors":"Jiaji He, Xiaolong Guo, Haocheng Ma, Yanjiang Liu, Yiqiang Zhao, Yier Jin","doi":"10.1109/DAC18072.2020.9218514","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218514","url":null,"abstract":"It has been widely demonstrated that the utilization of postdeployment trust evaluation approaches, such as side-channel measurements, along with statistical analysis methods is effective for detecting hardware Trojans in fabricated integrated circuits (ICs). However, more sophisticated Trojans proposed recently invalidate these methods with stealthy triggers and very-low side-channel signatures. Upon these challenges, in this paper, we propose an electromagnetic (EM) side-channel based post-fabrication trust evaluation framework which monitors EM radiations at runtime. The key component of the runtime trust evaluation framework is an on-chip EM sensor which can constantly measure and collect EM side-channel information of the target circuit. The simulation results validate the capability of the proposed framework in detecting stealthy hardware Trojans. Further, we fabricate an AES circuit protected by the proposed trust evaluation framework along with four different types of hardware Trojans. The measurements on the fabricated chips prove two key findings. First, the on-chip EM sensor can achieve a higher signal to noise ratio (SNR) and thus facilitate a better Trojan detection accuracy. Second, the trust evaluation framework can help detect different hardware Trojans at runtime.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132251344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218540
Yina Lv, Liang Shi, Qiao Li, C. Xue, E. Sha
Solid state drives (SSDs) are now widely deployed due to the development of high-density and low-cost NAND flash memories. Previous works have identified that the read performance of SSDs is degrading along with the development. One of the most critical reasons is the access interference between reads and writes, as the latest NAND flash memories have significant latency gap between reads and writes. This paper addresses this issue with the assistance of access characteristic guided SSD partitioning. First, several server workloads are studied and it is shown that reads and writes can be separated based on their access characteristics. Second, a set of techniques is proposed to place data judiciously for requests separation. Finally, a workload based SSD partitioning scheme is proposed to improve the read performance. The experimental results show that the proposed solution can improve read performance by 36% on average compared with the state-of-the-art solutions.
{"title":"Access Characteristic Guided Partition for Read Performance Improvement on Solid State Drives","authors":"Yina Lv, Liang Shi, Qiao Li, C. Xue, E. Sha","doi":"10.1109/DAC18072.2020.9218540","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218540","url":null,"abstract":"Solid state drives (SSDs) are now widely deployed due to the development of high-density and low-cost NAND flash memories. Previous works have identified that the read performance of SSDs is degrading along with the development. One of the most critical reasons is the access interference between reads and writes, as the latest NAND flash memories have significant latency gap between reads and writes. This paper addresses this issue with the assistance of access characteristic guided SSD partitioning. First, several server workloads are studied and it is shown that reads and writes can be separated based on their access characteristics. Second, a set of techniques is proposed to place data judiciously for requests separation. Finally, a workload based SSD partitioning scheme is proposed to improve the read performance. The experimental results show that the proposed solution can improve read performance by 36% on average compared with the state-of-the-art solutions.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"289 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132350540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218552
Po-Chun Chien, J. H. Jiang
Time multiplexing is an important technique to overcome the bandwidth bottleneck of limited input-output pins in FPGAs. Most prior work tackles the problem from a physical design standpoint to minimize the number of cut nets or Time Division Multiplexing (TDM) ratio through circuit partitioning or routing. In this work, we formulate a new orthogonal approach at the logic level to achieve time multiplexing through structural and functional circuit folding. The new formulation provides a smooth trade-off between bandwidth and throughput. Experiments show the effectiveness of the structural method and improved optimality of the functional method on look-up-table and flip-flop usage.
{"title":"Time Multiplexing via Circuit Folding","authors":"Po-Chun Chien, J. H. Jiang","doi":"10.1109/DAC18072.2020.9218552","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218552","url":null,"abstract":"Time multiplexing is an important technique to overcome the bandwidth bottleneck of limited input-output pins in FPGAs. Most prior work tackles the problem from a physical design standpoint to minimize the number of cut nets or Time Division Multiplexing (TDM) ratio through circuit partitioning or routing. In this work, we formulate a new orthogonal approach at the logic level to achieve time multiplexing through structural and functional circuit folding. The new formulation provides a smooth trade-off between bandwidth and throughput. Experiments show the effectiveness of the structural method and improved optimality of the functional method on look-up-table and flip-flop usage.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116544787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218609
Tzu-Wei Wang, Po-Chang Wu, Mark Po-Hung Lin
This paper introduces the first problem formulation in the literature for automatic MOM capacitor cell generation with adaptive capacitance. Given an expected capacitance value and available metal layers, the proposed capacitor cell generation method can produce a compact MOM capacitor cell with minimized area and matched capacitance. Compared with MOM capacitor cells with non-adaptive capacitance in the previous work, the experimental results show that the proposed adaptive MOM capacitor cell generation method can reduce 25% chip area and 20% power consumption of the capacitor network in successive-approximation-register analog-to-digital converters (SAR ADC).
{"title":"Late Breaking Results: Automatic Adaptive MOM Capacitor Cell Generation for Analog and Mixed-Signal Layout Design","authors":"Tzu-Wei Wang, Po-Chang Wu, Mark Po-Hung Lin","doi":"10.1109/DAC18072.2020.9218609","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218609","url":null,"abstract":"This paper introduces the first problem formulation in the literature for automatic MOM capacitor cell generation with adaptive capacitance. Given an expected capacitance value and available metal layers, the proposed capacitor cell generation method can produce a compact MOM capacitor cell with minimized area and matched capacitance. Compared with MOM capacitor cells with non-adaptive capacitance in the previous work, the experimental results show that the proposed adaptive MOM capacitor cell generation method can reduce 25% chip area and 20% power consumption of the capacitor network in successive-approximation-register analog-to-digital converters (SAR ADC).","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117066493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218728
Isak Edo Vivancos, Sayeh Sharify, M. Nikolic, Ciaran Bannon, M. Mahmoud, Alberto Delmas Lascorz, Andreas Moshovos
Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present Boveda, a lossless on-chip memory compression technique for neural networks operating on fixed-point values. Boveda reduces the datawidth used per block of values to be only as long as necessary: since most values are of small magnitude Boveda drastically reduces their footprint. Boveda can be used to increase the effective on-chip capacity, to reduce off-chip traffic, or to reduce the on-chip memory capacity needed to achieve a performance/energy target. Boveda reduces total model footprint to 53%.
{"title":"Late Breaking Results: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick","authors":"Isak Edo Vivancos, Sayeh Sharify, M. Nikolic, Ciaran Bannon, M. Mahmoud, Alberto Delmas Lascorz, Andreas Moshovos","doi":"10.1109/DAC18072.2020.9218728","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218728","url":null,"abstract":"Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present Boveda, a lossless on-chip memory compression technique for neural networks operating on fixed-point values. Boveda reduces the datawidth used per block of values to be only as long as necessary: since most values are of small magnitude Boveda drastically reduces their footprint. Boveda can be used to increase the effective on-chip capacity, to reduce off-chip traffic, or to reduce the on-chip memory capacity needed to achieve a performance/energy target. Boveda reduces total model footprint to 53%.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116255501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218605
G. Charan, Jubin Hazra, K. Beckmann, Xiaocong Du, Gokul Krishnan, R. Joshi, N. Cady, Yu Cao
Resistive random-access memory (RRAM) is a promising technology for in-memory computing with high storage density, fast inference, and good compatibility with CMOS. However, the mapping of a pre-trained deep neural network (DNN) model on RRAM suffers from realistic device issues, especially the variation and quantization error, resulting in a significant reduction in inference accuracy. In this work, we first extract these statistical properties from 65 nm RRAM data on 300mm wafers. The RRAM data present 10-levels in quantization and 50% variance, resulting in an accuracy drop to 31.76% and 10.49% for MNIST and CIFAR-10 datasets, respectively. Based on the experimental data, we propose a combination of machine learning algorithms and on-line adaptation to recover the accuracy with the minimum overhead. The recipe first applies Knowledge Distillation (KD) to transfer an ideal model into a student model with statistical variations and 10 levels. Furthermore, an on-line sparse adaptation (OSA) method is applied to the DNN model mapped on to the RRAM array. Using importance sampling, OSA adds a small SRAM array that is sparsely connected to the main RRAM array; only this SRAM array is updated to recover the accuracy. As demonstrated on MNIST and CIFAR-10 datasets, a 7.86% area cost is sufficient to achieve baseline accuracy for the 65 nm RRAM devices.
{"title":"Accurate Inference with Inaccurate RRAM Devices: Statistical Data, Model Transfer, and On-line Adaptation","authors":"G. Charan, Jubin Hazra, K. Beckmann, Xiaocong Du, Gokul Krishnan, R. Joshi, N. Cady, Yu Cao","doi":"10.1109/DAC18072.2020.9218605","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218605","url":null,"abstract":"Resistive random-access memory (RRAM) is a promising technology for in-memory computing with high storage density, fast inference, and good compatibility with CMOS. However, the mapping of a pre-trained deep neural network (DNN) model on RRAM suffers from realistic device issues, especially the variation and quantization error, resulting in a significant reduction in inference accuracy. In this work, we first extract these statistical properties from 65 nm RRAM data on 300mm wafers. The RRAM data present 10-levels in quantization and 50% variance, resulting in an accuracy drop to 31.76% and 10.49% for MNIST and CIFAR-10 datasets, respectively. Based on the experimental data, we propose a combination of machine learning algorithms and on-line adaptation to recover the accuracy with the minimum overhead. The recipe first applies Knowledge Distillation (KD) to transfer an ideal model into a student model with statistical variations and 10 levels. Furthermore, an on-line sparse adaptation (OSA) method is applied to the DNN model mapped on to the RRAM array. Using importance sampling, OSA adds a small SRAM array that is sparsely connected to the main RRAM array; only this SRAM array is updated to recover the accuracy. As demonstrated on MNIST and CIFAR-10 datasets, a 7.86% area cost is sufficient to achieve baseline accuracy for the 65 nm RRAM devices.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218523
Chaoqun Chu, Yanzhi Wang, Yilong Zhao, Xiaolong Ma, Shaokai Ye, Yunyan Hong, Xiaoyao Liang, Yinhe Han, Li Jiang
Deep Convolution Neural network (DCNN) pruning is an efficient way to reduce the resource and power consumption in a DCNN accelerator. Exploiting the sparsity in the weight matrices of DCNNs, however, is nontrivial if we deploy these DC-NNs in a crossbar-based Process-In-Memory (PIM) architecture, because of the crossbar structure. Structural pruning-exploiting a coarse-grained sparsity, such as filter/channel-level pruning-can result in a compressed weight matrix that fits the crossbar structure. However, this pruning method inevitably degrades the model accuracy. To solve this problem, in this paper, we propose PIM-PRUNE to exploit the finer-grained sparsity in PIM-architecture, and the resulting compressed weight matrices can significantly reduce the demand of crossbars with negligible accuracy loss.Further, we explore the design space of the crossbar, such as the crossbar size and aspect-ratio, from a new point-of-view of resource-oriented pruning. We find a trade-off existing between the pruning algorithm and the hardware overhead: a PIM with smaller crossbars is more friendly for pruning methods; however, the resulting peripheral circuit cause higher power consumption. Given a specific DCNN, we can suggest a sweet-spot of crossbar design to the optimal overall energy efficiency. Experimental results show that the proposed pruning method applied on Resnet18 can achieve up to 24.85× and 3.56× higher compression rate of occupied crossbars on CifarlO and Imagenet, respectively; while the accuracy loss is negligible, which is 4.56× and 1.99× better than the state-of-art methods.
{"title":"PIM-Prune: Fine-Grain DCNN Pruning for Crossbar-Based Process-In-Memory Architecture","authors":"Chaoqun Chu, Yanzhi Wang, Yilong Zhao, Xiaolong Ma, Shaokai Ye, Yunyan Hong, Xiaoyao Liang, Yinhe Han, Li Jiang","doi":"10.1109/DAC18072.2020.9218523","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218523","url":null,"abstract":"Deep Convolution Neural network (DCNN) pruning is an efficient way to reduce the resource and power consumption in a DCNN accelerator. Exploiting the sparsity in the weight matrices of DCNNs, however, is nontrivial if we deploy these DC-NNs in a crossbar-based Process-In-Memory (PIM) architecture, because of the crossbar structure. Structural pruning-exploiting a coarse-grained sparsity, such as filter/channel-level pruning-can result in a compressed weight matrix that fits the crossbar structure. However, this pruning method inevitably degrades the model accuracy. To solve this problem, in this paper, we propose PIM-PRUNE to exploit the finer-grained sparsity in PIM-architecture, and the resulting compressed weight matrices can significantly reduce the demand of crossbars with negligible accuracy loss.Further, we explore the design space of the crossbar, such as the crossbar size and aspect-ratio, from a new point-of-view of resource-oriented pruning. We find a trade-off existing between the pruning algorithm and the hardware overhead: a PIM with smaller crossbars is more friendly for pruning methods; however, the resulting peripheral circuit cause higher power consumption. Given a specific DCNN, we can suggest a sweet-spot of crossbar design to the optimal overall energy efficiency. Experimental results show that the proposed pruning method applied on Resnet18 can achieve up to 24.85× and 3.56× higher compression rate of occupied crossbars on CifarlO and Imagenet, respectively; while the accuracy loss is negligible, which is 4.56× and 1.99× better than the state-of-art methods.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121853730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}