Fan Zhang, Li Yang, Jian Meng, Jae-sun Seo, Yu Cao, Deliang Fan
ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells' low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm's benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.
{"title":"XMA: a crossbar-aware multi-task adaption framework via shift-based mask learning method","authors":"Fan Zhang, Li Yang, Jian Meng, Jae-sun Seo, Yu Cao, Deliang Fan","doi":"10.1145/3489517.3530458","DOIUrl":"https://doi.org/10.1145/3489517.3530458","url":null,"abstract":"ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells' low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm's benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133595620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Chen, Rahul Kande, Pouya Mahmoody, A. Sadeghi, J. Rajendran
The rise in the development of complex and application-specific commercial and open-source hardware and the shrinking verification time are causing numerous hardware-security vulnerabilities. Traditional verification techniques are limited in both scalability and completeness. Research in this direction is hindered due to the lack of robust testing benchmarks. In this paper, in collaboration with our industry partners, we built an ecosystem mimicking the hardware-development cycle where we inject bugs inspired by real-world vulnerabilities into RISC-V SoC design and organized an open-to-all bug-hunting competition. We equipped the participating researchers with industry-standard static and dynamic verification tools in a ready-to-use environment. The findings from our competition shed light on the strengths and weaknesses of the existing verification tools and highlight the potential for future research in developing new vulnerability detection techniques.
{"title":"Trusting the trust anchor: towards detecting cross-layer vulnerabilities with hardware fuzzing","authors":"Chen Chen, Rahul Kande, Pouya Mahmoody, A. Sadeghi, J. Rajendran","doi":"10.1145/3489517.3530638","DOIUrl":"https://doi.org/10.1145/3489517.3530638","url":null,"abstract":"The rise in the development of complex and application-specific commercial and open-source hardware and the shrinking verification time are causing numerous hardware-security vulnerabilities. Traditional verification techniques are limited in both scalability and completeness. Research in this direction is hindered due to the lack of robust testing benchmarks. In this paper, in collaboration with our industry partners, we built an ecosystem mimicking the hardware-development cycle where we inject bugs inspired by real-world vulnerabilities into RISC-V SoC design and organized an open-to-all bug-hunting competition. We equipped the participating researchers with industry-standard static and dynamic verification tools in a ready-to-use environment. The findings from our competition shed light on the strengths and weaknesses of the existing verification tools and highlight the potential for future research in developing new vulnerability detection techniques.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134373394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks. In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.
{"title":"PHANES: ReRAM-based photonic accelerator for deep neural networks","authors":"Yinyi Liu, Jiaqi Liu, Yuxiang Fu, Shixi Chen, Jiaxu Zhang, Jiang Xu","doi":"10.1145/3489517.3530397","DOIUrl":"https://doi.org/10.1145/3489517.3530397","url":null,"abstract":"Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks. In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133378070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most of existing Neural Network (NN)-based barrier certificate synthesis methods cannot deal with high-dimensional continuous systems, since a large quantity of sampled data may easily result in inaccurate initial models coupled with slow convergence rate. To accelerate the synthesis of NN-based barrier certificates, this paper presents an effective two-stage approach named CL-BC, which fully exploits the parallel processing capability of underlying hardware to enable quick search for a barrier certificate. Unlike existing NN-based methods that adopt a random initial model for barrier certificate synthesis, in the first stage CL-BC pre-trains an initial model based on a small subset of sampling data. In this way, an approximate barrier certificate in an NN form can be quickly achieved with little overhead. Based on our proposed collaborative learning scheme, in the second stage CL-BC conducts the parallel learning on partitioned domains, where the learned knowledge from different partitions can be aggregated to accelerate the convergence of a global NN model for barrier certificate synthesis. In this way, the overall synthesis time of an NN-based barrier certificate can be drastically reduced. Experimental results show that our approach can not only drastically reduce barrier synthesis time, but also can synthesize barrier certificates for complex systems that cannot be handled by state-of-the-art.
{"title":"Accelerated synthesis of neural network-based barrier certificates using collaborative learning","authors":"Jun Xia, Ming Hu, Xin Chen, Mingsong Chen","doi":"10.1145/3489517.3530608","DOIUrl":"https://doi.org/10.1145/3489517.3530608","url":null,"abstract":"Most of existing Neural Network (NN)-based barrier certificate synthesis methods cannot deal with high-dimensional continuous systems, since a large quantity of sampled data may easily result in inaccurate initial models coupled with slow convergence rate. To accelerate the synthesis of NN-based barrier certificates, this paper presents an effective two-stage approach named CL-BC, which fully exploits the parallel processing capability of underlying hardware to enable quick search for a barrier certificate. Unlike existing NN-based methods that adopt a random initial model for barrier certificate synthesis, in the first stage CL-BC pre-trains an initial model based on a small subset of sampling data. In this way, an approximate barrier certificate in an NN form can be quickly achieved with little overhead. Based on our proposed collaborative learning scheme, in the second stage CL-BC conducts the parallel learning on partitioned domains, where the learned knowledge from different partitions can be aggregated to accelerate the convergence of a global NN model for barrier certificate synthesis. In this way, the overall synthesis time of an NN-based barrier certificate can be drastically reduced. Experimental results show that our approach can not only drastically reduce barrier synthesis time, but also can synthesize barrier certificates for complex systems that cannot be handled by state-of-the-art.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134475535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, P. Cocchini, Zhenkun Yang, Jin Yang, L. Pouchet, Zhiru Zhang
The past decade has witnessed increasing adoption of high-level synthesis (HLS) to implement specialized hardware accelerators targeting either FPGAs or ASICs. However, current HLS programming models entangle algorithm specifications with hardware customization techniques, which lowers both the productivity and portability of the accelerator design. To tackle this problem, recent efforts such as HeteroCL propose to decouple algorithm definition from essential hardware customization techniques in compute, data type, and memory, increasing productivity, portability, and performance. While the decoupling of the algorithm and customizations provides benefits to the compilation/synthesis process, they also create new hurdles for the programmers to productively debug and validate the correctness of the optimized design. In this work, using HeteroCL and realistic machine learning applications as case studies, we first explain the key advantages of the decoupled programming model brought to a programmer to rapidly develop high-performance accelerators. Using the same case studies, we will further show how seemingly benign usage of the customization primitives can lead to new challenges to verification. We will then outline the research opportunities and discuss some of our recent efforts as the first step to enable a robust and viable verification solution in the future.
{"title":"Accelerator design with decoupled hardware customizations: benefits and challenges: invited","authors":"D. Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, P. Cocchini, Zhenkun Yang, Jin Yang, L. Pouchet, Zhiru Zhang","doi":"10.1145/3489517.3530681","DOIUrl":"https://doi.org/10.1145/3489517.3530681","url":null,"abstract":"The past decade has witnessed increasing adoption of high-level synthesis (HLS) to implement specialized hardware accelerators targeting either FPGAs or ASICs. However, current HLS programming models entangle algorithm specifications with hardware customization techniques, which lowers both the productivity and portability of the accelerator design. To tackle this problem, recent efforts such as HeteroCL propose to decouple algorithm definition from essential hardware customization techniques in compute, data type, and memory, increasing productivity, portability, and performance. While the decoupling of the algorithm and customizations provides benefits to the compilation/synthesis process, they also create new hurdles for the programmers to productively debug and validate the correctness of the optimized design. In this work, using HeteroCL and realistic machine learning applications as case studies, we first explain the key advantages of the decoupled programming model brought to a programmer to rapidly develop high-performance accelerators. Using the same case studies, we will further show how seemingly benign usage of the customization primitives can lead to new challenges to verification. We will then outline the research opportunities and discuss some of our recent efforts as the first step to enable a robust and viable verification solution in the future.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133366447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timing closure is a critical but effort-taking task in VLSI designs. In placement stage, a fast and accurate net delay estimator is highly desirable to guide the timing optimization prior to routing, and thus reduce the timing pessimism and shorten the design turn-around time. To handle the timing uncertainty at the placement stage, we propose a fast net delay timing predictor based on machine learning, which extract the fully timing features using a look-ahead RC network. Experimental results show that the proposed timing predictor has achieved average correlation over 0.99 with the post-routing sign-off timing results obtained in Synopsys PrimeTime.
{"title":"Accurate timing prediction at placement stage with look-ahead RC network","authors":"Xu He, Zhiyong Fu, Yao Wang, Chang Liu, Yang Guo","doi":"10.1145/3489517.3530598","DOIUrl":"https://doi.org/10.1145/3489517.3530598","url":null,"abstract":"Timing closure is a critical but effort-taking task in VLSI designs. In placement stage, a fast and accurate net delay estimator is highly desirable to guide the timing optimization prior to routing, and thus reduce the timing pessimism and shorten the design turn-around time. To handle the timing uncertainty at the placement stage, we propose a fast net delay timing predictor based on machine learning, which extract the fully timing features using a look-ahead RC network. Experimental results show that the proposed timing predictor has achieved average correlation over 0.99 with the post-routing sign-off timing results obtained in Synopsys PrimeTime.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133849391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The optical interconnection is a promising solution for on-chip signal communication in modern system-on-chip (SoC) and heterogeneous integration designs, providing large bandwidth and high-speed transmission with low power consumption. Previous works do not handle two main issues for on-chip optical-electrical (O-E) co-design: the thermal impact during O-E routing and the trade-offs among power consumption, wirelength, and congestion. As a result, the thermal-induced band shift might incur transmission malfunction; the power consumption estimation is inaccurate; thus, only suboptimal results are obtained. To remedy these disadvantages, we present a thermal-aware optical-electrical routing co-design flow to minimize power consumption, thermal impact, and wirelength. Experimental results based on the ISPD 2019 contest benchmarks show that our co-design flow significantly outperforms state-of-the-art works in power consumption, thermal impact, and wire-length.
{"title":"Thermal-aware optical-electrical routing codesign for on-chip signal communications","authors":"Yu-Sheng Lu, Kuan-Cheng Chen, Yu-Ling Hsu, Yao-Wen Chang","doi":"10.1145/3489517.3530404","DOIUrl":"https://doi.org/10.1145/3489517.3530404","url":null,"abstract":"The optical interconnection is a promising solution for on-chip signal communication in modern system-on-chip (SoC) and heterogeneous integration designs, providing large bandwidth and high-speed transmission with low power consumption. Previous works do not handle two main issues for on-chip optical-electrical (O-E) co-design: the thermal impact during O-E routing and the trade-offs among power consumption, wirelength, and congestion. As a result, the thermal-induced band shift might incur transmission malfunction; the power consumption estimation is inaccurate; thus, only suboptimal results are obtained. To remedy these disadvantages, we present a thermal-aware optical-electrical routing co-design flow to minimize power consumption, thermal impact, and wirelength. Experimental results based on the ISPD 2019 contest benchmarks show that our co-design flow significantly outperforms state-of-the-art works in power consumption, thermal impact, and wire-length.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"50 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114032203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM's non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario. In this paper, we propose SRA - a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM's in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer's intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC - a non-secure ReRAM-based baseline.
{"title":"SRA: a secure ReRAM-based DNN accelerator","authors":"Lei Zhao, Youtao Zhang, Jun Yang","doi":"10.1145/3489517.3530440","DOIUrl":"https://doi.org/10.1145/3489517.3530440","url":null,"abstract":"Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM's non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario. In this paper, we propose SRA - a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM's in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer's intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC - a non-secure ReRAM-based baseline.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.
{"title":"Write or not: programming scheme optimization for RRAM-based neuromorphic computing","authors":"Ziqi Meng, Yanan Sun, Weikang Qian","doi":"10.1145/3489517.3530558","DOIUrl":"https://doi.org/10.1145/3489517.3530558","url":null,"abstract":"One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114768247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Bohm Agostini, S. Curzel, Ankur Limaye, Vinay C. Amatya, Marco Minutoli, Vito Giovanni Castellana, J. Manzano, Antonino Tumeo, Fabrizio Ferrandi
Novel "converged" applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.
{"title":"The SODA approach: leveraging high-level synthesis for hardware/software co-design and hardware specialization: invited","authors":"Nicolas Bohm Agostini, S. Curzel, Ankur Limaye, Vinay C. Amatya, Marco Minutoli, Vito Giovanni Castellana, J. Manzano, Antonino Tumeo, Fabrizio Ferrandi","doi":"10.1145/3489517.3530628","DOIUrl":"https://doi.org/10.1145/3489517.3530628","url":null,"abstract":"Novel \"converged\" applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115087169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}