首页 > 最新文献

Proceedings of the 59th ACM/IEEE Design Automation Conference最新文献

英文 中文
XMA: a crossbar-aware multi-task adaption framework via shift-based mask learning method 基于移位掩模学习方法的跨栏感知多任务自适应框架
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530458
Fan Zhang, Li Yang, Jian Meng, Jae-sun Seo, Yu Cao, Deliang Fan
ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells' low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm's benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.
ReRAM交叉棒阵列作为一种高并行、快速、节能的结构受到了广泛的关注,特别是在加速深度神经网络对特定任务的推理方面。然而,由于重量重编程的高能量消耗和ReRAM单元的低耐力问题,使交叉杆阵列适应多任务尚未得到很好的探索。本文首次在ReRAM交叉棒深度神经网络加速器中提出了一种新的基于交叉棒感知位移的掩模学习方法XMA,用于多任务自适应。XMA利用流行的基于掩码的学习算法的优点来减轻灾难性的遗忘,并为每个基于固定主干模型的新任务学习特定于任务的、跨栏的、基于列的和基于移位的多级掩码,而不是最常用的基于元素的二进制掩码。通过我们的交叉棒感知设计创新,适应新任务所需的掩蔽操作可以在现有的基于交叉棒的卷积引擎中实现,硬件/内存开销最小,更重要的是,不需要像以前的工作那样需要耗电的单元重新编程。大量的实验结果表明,与目前最先进的多任务自适应Piggyback方法[1]相比,XMA的准确率平均提高了3.19%,内存开销节省了96.6%。此外,通过消除细胞重编程,XMA实现了比Piggyback高4.3倍的能源效率。
{"title":"XMA: a crossbar-aware multi-task adaption framework via shift-based mask learning method","authors":"Fan Zhang, Li Yang, Jian Meng, Jae-sun Seo, Yu Cao, Deliang Fan","doi":"10.1145/3489517.3530458","DOIUrl":"https://doi.org/10.1145/3489517.3530458","url":null,"abstract":"ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells' low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm's benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133595620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Trusting the trust anchor: towards detecting cross-layer vulnerabilities with hardware fuzzing 信任锚:通过硬件模糊测试检测跨层漏洞
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530638
Chen Chen, Rahul Kande, Pouya Mahmoody, A. Sadeghi, J. Rajendran
The rise in the development of complex and application-specific commercial and open-source hardware and the shrinking verification time are causing numerous hardware-security vulnerabilities. Traditional verification techniques are limited in both scalability and completeness. Research in this direction is hindered due to the lack of robust testing benchmarks. In this paper, in collaboration with our industry partners, we built an ecosystem mimicking the hardware-development cycle where we inject bugs inspired by real-world vulnerabilities into RISC-V SoC design and organized an open-to-all bug-hunting competition. We equipped the participating researchers with industry-standard static and dynamic verification tools in a ready-to-use environment. The findings from our competition shed light on the strengths and weaknesses of the existing verification tools and highlight the potential for future research in developing new vulnerability detection techniques.
复杂的、特定于应用程序的商业和开源硬件开发的增加以及验证时间的缩短导致了许多硬件安全漏洞。传统的验证技术在可扩展性和完整性方面都受到限制。由于缺乏可靠的测试基准,这方面的研究受到阻碍。在本文中,我们与行业合作伙伴合作,建立了一个模仿硬件开发周期的生态系统,在该生态系统中,我们将受现实世界漏洞启发的漏洞注入RISC-V SoC设计中,并组织了一场面向所有人的漏洞搜索比赛。我们在现成的环境中为参与的研究人员配备了工业标准的静态和动态验证工具。我们的竞赛结果揭示了现有验证工具的优缺点,并强调了未来研究开发新的漏洞检测技术的潜力。
{"title":"Trusting the trust anchor: towards detecting cross-layer vulnerabilities with hardware fuzzing","authors":"Chen Chen, Rahul Kande, Pouya Mahmoody, A. Sadeghi, J. Rajendran","doi":"10.1145/3489517.3530638","DOIUrl":"https://doi.org/10.1145/3489517.3530638","url":null,"abstract":"The rise in the development of complex and application-specific commercial and open-source hardware and the shrinking verification time are causing numerous hardware-security vulnerabilities. Traditional verification techniques are limited in both scalability and completeness. Research in this direction is hindered due to the lack of robust testing benchmarks. In this paper, in collaboration with our industry partners, we built an ecosystem mimicking the hardware-development cycle where we inject bugs inspired by real-world vulnerabilities into RISC-V SoC design and organized an open-to-all bug-hunting competition. We equipped the participating researchers with industry-standard static and dynamic verification tools in a ready-to-use environment. The findings from our competition shed light on the strengths and weaknesses of the existing verification tools and highlight the potential for future research in developing new vulnerability detection techniques.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134373394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PHANES: ReRAM-based photonic accelerator for deep neural networks PHANES:基于reram的深度神经网络光子加速器
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530397
Yinyi Liu, Jiaqi Liu, Yuxiang Fu, Shixi Chen, Jiaxu Zhang, Jiang Xu
Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks. In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.
电阻式随机存取存储器(ReRAM)已经证明了原位矩阵向量乘法在加速深度神经网络方面的巨大前景。然而,由于模拟处理的固有特性,大多数基于reram的加速器都需要昂贵的ADC/DAC来避免电子模拟信号在片间传输过程中的失真。此外,由于在加法之前进行了位移位,与乘法相比,先前的工作需要更长的周期来连续计算部分和,这极大地限制了吞吐量,并且更有可能使深度神经网络层之间的管道停滞。在本文中,我们提出了一种新的基于ReRAM的光子加速器(PHANES)架构,该架构可以计算ReRAM中的乘法和光传输过程中的并行加权积累。这种光子模式也可以作为高保真的模拟链路,进一步减少ADC/DAC。为了规避内存墙问题,我们进一步提出了一种渐进式位深度技术。评估表明,与最先进的设计相比,PHANES的能源效率提高了6.09倍,吞吐量密度提高了14.7倍。我们的光子架构也有很大的潜力可扩展到非常大规模的加速器。
{"title":"PHANES: ReRAM-based photonic accelerator for deep neural networks","authors":"Yinyi Liu, Jiaqi Liu, Yuxiang Fu, Shixi Chen, Jiaxu Zhang, Jiang Xu","doi":"10.1145/3489517.3530397","DOIUrl":"https://doi.org/10.1145/3489517.3530397","url":null,"abstract":"Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks. In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133378070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thermal-aware optical-electrical routing codesign for on-chip signal communications 片上信号通信的热感知光电路由协同设计
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530404
Yu-Sheng Lu, Kuan-Cheng Chen, Yu-Ling Hsu, Yao-Wen Chang
The optical interconnection is a promising solution for on-chip signal communication in modern system-on-chip (SoC) and heterogeneous integration designs, providing large bandwidth and high-speed transmission with low power consumption. Previous works do not handle two main issues for on-chip optical-electrical (O-E) co-design: the thermal impact during O-E routing and the trade-offs among power consumption, wirelength, and congestion. As a result, the thermal-induced band shift might incur transmission malfunction; the power consumption estimation is inaccurate; thus, only suboptimal results are obtained. To remedy these disadvantages, we present a thermal-aware optical-electrical routing co-design flow to minimize power consumption, thermal impact, and wirelength. Experimental results based on the ISPD 2019 contest benchmarks show that our co-design flow significantly outperforms state-of-the-art works in power consumption, thermal impact, and wire-length.
在现代片上系统(SoC)和异构集成设计中,光互连是一种很有前途的片上信号通信解决方案,可以提供大带宽和低功耗的高速传输。先前的工作没有处理片上光电(O-E)协同设计的两个主要问题:O-E路由期间的热影响以及功耗,无线长度和拥塞之间的权衡。因此,热致带移可能导致传输故障;功耗估算不准确;因此,只能得到次优结果。为了弥补这些缺点,我们提出了一种热感知的光电路由协同设计流程,以最大限度地减少功耗,热影响和带宽。基于ISPD 2019竞赛基准的实验结果表明,我们的协同设计流程在功耗、热影响和导线长度方面明显优于最先进的作品。
{"title":"Thermal-aware optical-electrical routing codesign for on-chip signal communications","authors":"Yu-Sheng Lu, Kuan-Cheng Chen, Yu-Ling Hsu, Yao-Wen Chang","doi":"10.1145/3489517.3530404","DOIUrl":"https://doi.org/10.1145/3489517.3530404","url":null,"abstract":"The optical interconnection is a promising solution for on-chip signal communication in modern system-on-chip (SoC) and heterogeneous integration designs, providing large bandwidth and high-speed transmission with low power consumption. Previous works do not handle two main issues for on-chip optical-electrical (O-E) co-design: the thermal impact during O-E routing and the trade-offs among power consumption, wirelength, and congestion. As a result, the thermal-induced band shift might incur transmission malfunction; the power consumption estimation is inaccurate; thus, only suboptimal results are obtained. To remedy these disadvantages, we present a thermal-aware optical-electrical routing co-design flow to minimize power consumption, thermal impact, and wirelength. Experimental results based on the ISPD 2019 contest benchmarks show that our co-design flow significantly outperforms state-of-the-art works in power consumption, thermal impact, and wire-length.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"50 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114032203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PHANES
Pub Date : 2022-07-10 DOI: 10.1163/2214-8647_bnp_e918010
Yinyi Liu, Jiaqi Liu, Yuxiang Fu, Shixi Chen, Jiaxu Zhang, Jiang Xu
{"title":"PHANES","authors":"Yinyi Liu, Jiaqi Liu, Yuxiang Fu, Shixi Chen, Jiaxu Zhang, Jiang Xu","doi":"10.1163/2214-8647_bnp_e918010","DOIUrl":"https://doi.org/10.1163/2214-8647_bnp_e918010","url":null,"abstract":"","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"14 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114044473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The SODA approach: leveraging high-level synthesis for hardware/software co-design and hardware specialization: invited SODA方法:利用硬件/软件协同设计和硬件专门化的高级综合:邀请
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530628
Nicolas Bohm Agostini, S. Curzel, Ankur Limaye, Vinay C. Amatya, Marco Minutoli, Vito Giovanni Castellana, J. Manzano, Antonino Tumeo, Fabrizio Ferrandi
Novel "converged" applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.
新颖的“融合”应用将科学模拟与数据分析和机器学习相结合。每个计算阶段都可以从专门的加速器中获益。然而,算法发展得如此之快,以至于将它们映射到现有的加速器上是次优的,甚至是不可能的。本文介绍了SODA(软件定义的加速器)框架,它是一个模块化的、多层次的、开源的、无人在环的硬件合成器,能够端到端生成专门的加速器。SODA由SODA- opt和Bambu组成,前者是在MLIR中开发的高级前端,可与特定领域的编程框架接口,并允许执行系统级设计,后者是最先进的高级合成引擎,可针对不同的设备技术。该框架在编译器优化通过时实现设计空间探索。我们将展示高级优化器和低级HLS工具的模块化紧密集成如何支持生成针对聚合应用程序的计算模式进行优化的加速器。然后我们讨论了这样一个框架允许的一些研究机会,包括系统级设计、配置文件驱动的优化,以及支持新的优化度量。
{"title":"The SODA approach: leveraging high-level synthesis for hardware/software co-design and hardware specialization: invited","authors":"Nicolas Bohm Agostini, S. Curzel, Ankur Limaye, Vinay C. Amatya, Marco Minutoli, Vito Giovanni Castellana, J. Manzano, Antonino Tumeo, Fabrizio Ferrandi","doi":"10.1145/3489517.3530628","DOIUrl":"https://doi.org/10.1145/3489517.3530628","url":null,"abstract":"Novel \"converged\" applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115087169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Write or not: programming scheme optimization for RRAM-based neuromorphic computing 写或不写:基于随机存储器的神经形态计算的编程方案优化
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530558
Ziqi Meng, Yanan Sun, Weikang Qian
One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.
基于电阻性随机存取存储器交叉条的神经网络加速器容错的主要方法是基于编程的方法,也称为写验证(W-V)。在基本的W-V方案中,交叉杆中的所有设备都要重复编程,直到它们足够接近目标,这需要花费巨大的开销。为了降低成本,我们对W-V方案进行了优化,提出了单器件的概率终止准则和多器件的系统优化方法。此外,我们提出了一种联合算法,通过增量再训练来辅助新的W-V方案,进一步降低了W-V成本。与基本的W-V方案相比,我们提出的方法在σ = 1.2的变化条件下,以9.7%的W-V代价,将ResNet18在CIFAR10上的精度提高了0.23%。
{"title":"Write or not: programming scheme optimization for RRAM-based neuromorphic computing","authors":"Ziqi Meng, Yanan Sun, Weikang Qian","doi":"10.1145/3489517.3530558","DOIUrl":"https://doi.org/10.1145/3489517.3530558","url":null,"abstract":"One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114768247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SRA: a secure ReRAM-based DNN accelerator SRA:一个安全的基于reram的DNN加速器
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530440
Lei Zhao, Youtao Zhang, Jun Yang
Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM's non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario. In this paper, we propose SRA - a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM's in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer's intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC - a non-secure ReRAM-based baseline.
为了追求深度神经网络计算的高效率,深度神经网络加速器日益得到发展。然而,部署在这种加速器上的dnn的知识产权保护是一个很少被解决的重要话题。虽然之前有针对基于cmos设计的这个问题的工作,但是基于reram的加速器由于其交叉杆结构和非易失性而带来了新的安全挑战,仍然没有解决方案。即使在系统断电后,ReRAM的非易失性也会保留数据,这使得存储的DNN模型容易受到简单读取ReRAM内容的攻击。由于交叉栏结构只能对明文数据进行计算,因此在这种情况下,加密ReRAM内容不再是可行的解决方案。在本文中,我们提出了SRA——一种安全的基于ReRAM的深度神经网络加速器,它以加密格式将深度神经网络权重存储在交叉条上,同时仍然保持ReRAM的内存计算能力。所提出的加密方案还支持在多个权重之间共享比特,从而大大降低了存储开销。此外,SRA采用了一种新颖的高带宽SC转换方案来保护各层的中间结果,这些中间结果也包含了模型的敏感信息。我们的实验结果表明,SRA可以有效地防止盗用部署的DNN权重和中间结果,精度损失可以忽略不计,并且与基于非安全reram的ISAAC(基于非安全reram的基线)相比,实现了1.14倍的性能加速和9%的能量减少。
{"title":"SRA: a secure ReRAM-based DNN accelerator","authors":"Lei Zhao, Youtao Zhang, Jun Yang","doi":"10.1145/3489517.3530440","DOIUrl":"https://doi.org/10.1145/3489517.3530440","url":null,"abstract":"Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM's non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario. In this paper, we propose SRA - a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM's in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer's intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC - a non-secure ReRAM-based baseline.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Pref-X: a framework to reveal data prefetching in commercial in-order cores Pref-X:一个框架,用于揭示商业顺序核中的数据预取
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530569
Quentin Huppert, F. Catthoor, L. Torres, D. Novo
Computer system simulators are major tools used by architecture researchers to develop and evaluate new ideas. Clearly, such evaluations are more conclusive when compared to commercial state-of-the-art architectures. However, the behavior of key components in existing processors is often not disclosed, complicating the construction of faithful reference models. The data prefetching engine is one of such obscured components that can have a significant impact on key metrics such as performance and energy. In this paper, we propose Pref-X, a framework to analyze functional characteristics of data prefetching in commercial in-order cores. Our framework reveals data prefetches by X-raying into the cache memory at the request granularity, which allows linking memory access patterns with changes in the cache content. To demonstrate the power and accuracy of our methodology, we use Pref-X to replicate the data prefetching mechanisms of two representative processors, namely the Arm Cortex-A7 and the Arm Cortex-A53, with a 99.8% and 96.9% average accuracy, respectively.
计算机系统模拟器是建筑研究人员用来开发和评估新想法的主要工具。显然,与商业最先进的体系结构相比,这样的评估更具决定性。然而,现有处理器中关键组件的行为通常不公开,使忠实参考模型的构建复杂化。数据预取引擎就是这样一个模糊的组件,它可能对性能和能耗等关键指标产生重大影响。在本文中,我们提出了Pref-X框架来分析商业顺序核中数据预取的功能特征。我们的框架通过在请求粒度上对缓存进行x射线扫描来显示数据预取,这允许将内存访问模式与缓存内容的更改联系起来。为了证明我们方法的强大和准确性,我们使用Pref-X来复制两种代表性处理器(即Arm Cortex-A7和Arm Cortex-A53)的数据预取机制,平均准确率分别为99.8%和96.9%。
{"title":"Pref-X: a framework to reveal data prefetching in commercial in-order cores","authors":"Quentin Huppert, F. Catthoor, L. Torres, D. Novo","doi":"10.1145/3489517.3530569","DOIUrl":"https://doi.org/10.1145/3489517.3530569","url":null,"abstract":"Computer system simulators are major tools used by architecture researchers to develop and evaluate new ideas. Clearly, such evaluations are more conclusive when compared to commercial state-of-the-art architectures. However, the behavior of key components in existing processors is often not disclosed, complicating the construction of faithful reference models. The data prefetching engine is one of such obscured components that can have a significant impact on key metrics such as performance and energy. In this paper, we propose Pref-X, a framework to analyze functional characteristics of data prefetching in commercial in-order cores. Our framework reveals data prefetches by X-raying into the cache memory at the request granularity, which allows linking memory access patterns with changes in the cache content. To demonstrate the power and accuracy of our methodology, we use Pref-X to replicate the data prefetching mechanisms of two representative processors, namely the Arm Cortex-A7 and the Arm Cortex-A53, with a 99.8% and 96.9% average accuracy, respectively.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"476 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116189453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
iMARS
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530478
Mengyuan Li, Ann Franchesca Laguna, D. Reis, Xunzhao Yin, M. Niemier, X. S. Hu
Recommendation systems (RecSys) suggest items to users by predicting their preferences based on historical data. Typical RecSys handle large embedding tables and many embedding table related operations. The memory size and bandwidth of the conventional computer architecture restrict the performance of RecSys. This work proposes an in-memory-computing (IMC) architecture (iMARS) for accelerating the filtering and ranking stages of deep neural network-based RecSys. iMARS leverages IMC-friendly embedding tables implemented inside a ferroelectric FET based IMC fabric. Circuit-level and system-level evaluation show that iMARS achieves 16.8x (713x) end-to-end latency (energy) improvement compared to the GPU counterpart for the MovieLens dataset.
{"title":"iMARS","authors":"Mengyuan Li, Ann Franchesca Laguna, D. Reis, Xunzhao Yin, M. Niemier, X. S. Hu","doi":"10.1145/3489517.3530478","DOIUrl":"https://doi.org/10.1145/3489517.3530478","url":null,"abstract":"Recommendation systems (RecSys) suggest items to users by predicting their preferences based on historical data. Typical RecSys handle large embedding tables and many embedding table related operations. The memory size and bandwidth of the conventional computer architecture restrict the performance of RecSys. This work proposes an in-memory-computing (IMC) architecture (iMARS) for accelerating the filtering and ranking stages of deep neural network-based RecSys. iMARS leverages IMC-friendly embedding tables implemented inside a ferroelectric FET based IMC fabric. Circuit-level and system-level evaluation show that iMARS achieves 16.8x (713x) end-to-end latency (energy) improvement compared to the GPU counterpart for the MovieLens dataset.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"506 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115336761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 59th ACM/IEEE Design Automation Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1