首页 > 最新文献

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献

英文 中文
Design Space and Memory Technology Co-exploration for In-Memory Computing Based Machine Learning Accelerators
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549453
Kang He, I. Chakraborty, Cheng Wang, K. Roy
In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.
内存计算(IMC)已经成为加速机器学习(ML)推理的一个有前途的范例。虽然与传统的数字架构相比,基于各种存储技术的IMC架构显示出更高的吞吐量和能效,但很少有研究从系统级的角度对相同硬件预算(区域)下不同存储技术进行全面和公平的比较。由于大规模模拟IMC硬件依赖于昂贵的模数转换器(adc)来实现稳健的数字通信,因此优化IMC架构性能需要存储器阵列和外围adc的协同设计,其中的权衡可能取决于底层存储器技术。为此,我们共同探索IMC宏观设计空间和存储技术,以确定在等面积预算下每种存储类型的最佳设计点,旨在对不同技术进行公平比较,包括SRAM,相变存储器,电阻式RAM,铁电体和自旋电子学。首先,开发了采用片外DRAM的空间架构的扩展仿真框架,能够集成CMOS和非易失性存储技术。随后,我们提出了具有不同权重映射方案的不同ADC操作模式,以应对不同的片上面积预算。我们的研究结果表明,在等面积预算下,评估的各种存储技术将需要采用不同的IMC宏观设计,以在系统级提供最佳的能量延迟积(EDP)。我们证明在小面积预算下,最佳存储技术的选择是由其单元面积和写入能量决定的。随着面积预算的增加,小区面积成为技术选择的主要因素。
{"title":"Design Space and Memory Technology Co-exploration for In-Memory Computing Based Machine Learning Accelerators","authors":"Kang He, I. Chakraborty, Cheng Wang, K. Roy","doi":"10.1145/3508352.3549453","DOIUrl":"https://doi.org/10.1145/3508352.3549453","url":null,"abstract":"In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133892865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
AntiSIFA-CAD: A Framework to Thwart SIFA at the Layout Level AntiSIFA-CAD:在布局级阻止SIFA的框架
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549480
Rajat Sadhukhan, Sayandeep Saha, Debdeep Mukhopadhyay
Fault Attacks (FA) have gained a lot of attention from both industry and academia due to their practicality, and wide applicability to different domains of computing. In the context of symmetric-key cryptography, designing countermeasures against FA is still an open problem. Recently proposed attacks such as Statistical Ineffective Fault Analysis (SIFA) has shown that merely adding redundancy or infection-based countermeasure to detect the fault doesn’t work and a proper combination of masking and error correction/detection is required. In this work, we show that masking which is mathematically established as a good countermeasure against a certain class of SIFA faults, in practice may fall short if low-level details during physical design layout development are not taken care of. We initiate this study by demonstrating a successful SIFA attack on a post placed-and-routed masked crypto design for ASIC platform. Eventually, we propose a fully automated approach along with a proper choice of placement constraints which can be realized easily for any commercial CAD tools to successfully get rid of this vulnerability during the physical layout development process. Our experimental validation of our tool flow over masked implementation on PRESENT cipher establishes our claim.
故障攻击(FA)由于其实用性和在不同计算领域的广泛适用性,受到了工业界和学术界的广泛关注。在对称密钥密码学的背景下,设计对抗FA的对策仍然是一个悬而未决的问题。最近提出的攻击,如统计无效故障分析(SIFA)表明,仅仅增加冗余或基于感染的对策来检测故障是不起作用的,需要屏蔽和纠错/检测的适当组合。在这项工作中,我们表明掩蔽是数学上建立的针对某类SIFA故障的良好对策,如果在物理设计布局开发过程中不考虑底层细节,在实践中可能会失败。我们通过展示对ASIC平台的后放置和路由掩码加密设计的成功SIFA攻击来启动这项研究。最后,我们提出了一种完全自动化的方法,以及适当的放置约束选择,可以很容易地实现任何商业CAD工具,在物理布局开发过程中成功地摆脱这个漏洞。我们的工具流在PRESENT密码上的掩码实现的实验验证证实了我们的主张。
{"title":"AntiSIFA-CAD: A Framework to Thwart SIFA at the Layout Level","authors":"Rajat Sadhukhan, Sayandeep Saha, Debdeep Mukhopadhyay","doi":"10.1145/3508352.3549480","DOIUrl":"https://doi.org/10.1145/3508352.3549480","url":null,"abstract":"Fault Attacks (FA) have gained a lot of attention from both industry and academia due to their practicality, and wide applicability to different domains of computing. In the context of symmetric-key cryptography, designing countermeasures against FA is still an open problem. Recently proposed attacks such as Statistical Ineffective Fault Analysis (SIFA) has shown that merely adding redundancy or infection-based countermeasure to detect the fault doesn’t work and a proper combination of masking and error correction/detection is required. In this work, we show that masking which is mathematically established as a good countermeasure against a certain class of SIFA faults, in practice may fall short if low-level details during physical design layout development are not taken care of. We initiate this study by demonstrating a successful SIFA attack on a post placed-and-routed masked crypto design for ASIC platform. Eventually, we propose a fully automated approach along with a proper choice of placement constraints which can be realized easily for any commercial CAD tools to successfully get rid of this vulnerability during the physical layout development process. Our experimental validation of our tool flow over masked implementation on PRESENT cipher establishes our claim.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132505339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining BMC and Complementary Approximate Reachability to Accelerate Bug-Finding 结合BMC和互补近似可达性来加速bug查找
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549393
Xiaoyu Zhang, Shengping Xiao, Jianwen Li, G. Pu, O. Strichman
Bounded Model Checking (BMC) is so far considered as the best engine for bug-finding in hardware model checking. Given a bound K, BMC can detect if there is a counterexample to a given temporal property within K steps from the initial state, thus performing a global-style search. Recently, a SAT-based model-checking technique called Complementary Approximate Reachability (CAR) was shown to be complementary to BMC, in the sense that frequently they can solve instances that the other technique cannot, within the same time limit. CAR detects a counterexample gradually with the guidance of an over-approximating state sequence, and performs a local-style search. In this paper, we consider three different ways to combine BMC and CAR. Our experiments show that they all outperform BMC and CAR on their own, and solve instances that cannot be solved by these two techniques. Our findings are based on a comprehensive experimental evaluation using the benchmarks of two hardware model checking competitions.
有界模型检查(BMC)被认为是目前硬件模型检查中最好的bug查找引擎。给定一个界限K, BMC可以在初始状态的K步内检测给定时间属性是否存在反例,从而执行全局式搜索。最近,一种称为互补近似可达性(CAR)的基于sat的模型检查技术被证明是对BMC的补充,因为它们通常可以在相同的时间限制内解决其他技术无法解决的实例。CAR算法在过逼近状态序列的引导下逐步检测反例,并进行局部搜索。在本文中,我们考虑了三种不同的方式来结合BMC和CAR。我们的实验表明,它们本身都优于BMC和CAR,并解决了这两种技术无法解决的实例。我们的发现是基于使用两个硬件模型检查竞赛基准的综合实验评估。
{"title":"Combining BMC and Complementary Approximate Reachability to Accelerate Bug-Finding","authors":"Xiaoyu Zhang, Shengping Xiao, Jianwen Li, G. Pu, O. Strichman","doi":"10.1145/3508352.3549393","DOIUrl":"https://doi.org/10.1145/3508352.3549393","url":null,"abstract":"Bounded Model Checking (BMC) is so far considered as the best engine for bug-finding in hardware model checking. Given a bound K, BMC can detect if there is a counterexample to a given temporal property within K steps from the initial state, thus performing a global-style search. Recently, a SAT-based model-checking technique called Complementary Approximate Reachability (CAR) was shown to be complementary to BMC, in the sense that frequently they can solve instances that the other technique cannot, within the same time limit. CAR detects a counterexample gradually with the guidance of an over-approximating state sequence, and performs a local-style search. In this paper, we consider three different ways to combine BMC and CAR. Our experiments show that they all outperform BMC and CAR on their own, and solve instances that cannot be solved by these two techniques. Our findings are based on a comprehensive experimental evaluation using the benchmarks of two hardware model checking competitions.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"21 17","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133170009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Qubit Mapping for Reconfigurable Atom Arrays 可重构原子阵列的量子比特映射
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549331
Bochen Tan, D. Bluvstein, M. Lukin, J. Cong
Because of the largest number of qubits available, and the massive parallel execution of entangling two-qubit gates, atom arrays is a promising platform for quantum computing. The qubits are selectively loaded into arrays of optical traps, some of which can be moved during the computation itself. By adjusting the locations of the traps and shining a specific global laser, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. In comparison, previous QC architectures only generate entanglement on a fixed set of quantum register pairs. Thus, reconfigurable atom arrays (RAA) present a new challenge for QC compilation, especially the qubit mapping/layout synthesis stage which decides the qubit placement and gate scheduling. In this paper, we consider an RAA QC architecture that contains multiple arrays, supports 2D array movements, represents cutting-edge experimental platforms, and is much more general than previous works. We start by systematically examining the fundamental constraints on RAA imposed by physics. Built upon this understanding, we discretize the state space of the architecture, and we formulate layout synthesis for such an architecture to a satisfactory modulo theories problem. Finally, we demonstrate our work by compiling the quantum approximate optimization algorithm (QAOA), one of the promising near-term quantum computing applications. Our layout synthesizer reduces the number of required native two-qubit gates in 22-qubit QAOA by 5.72x (geomean) compared to leading experiments on a superconducting architecture. Combined with a better coherence time, there is an order-of-magnitude increase in circuit fidelity.
由于可用的量子比特数量最多,以及纠缠双量子比特门的大规模并行执行,原子阵列是一个很有前途的量子计算平台。量子比特被选择性地装入光学陷阱阵列,其中一些可以在计算过程中移动。通过调整陷阱的位置并照射特定的全局激光,不同的量子比特对,即使是那些最初相距很远的,也可以在量子程序执行的不同阶段纠缠在一起。相比之下,以前的QC架构只在一组固定的量子寄存器对上产生纠缠。因此,可重构原子阵列(RAA)对QC编译提出了新的挑战,特别是决定量子比特放置和门调度的量子比特映射/布局合成阶段。在本文中,我们考虑了一个RAA QC架构,它包含多个阵列,支持二维阵列运动,代表了前沿的实验平台,并且比以前的工作更通用。我们从系统地检查物理对RAA施加的基本约束开始。在此基础上,我们将结构的状态空间离散化,并将这种结构的布局综合表述为一个令人满意的模理论问题。最后,我们通过编译量子近似优化算法(QAOA)来展示我们的工作,量子近似优化算法是近期有前途的量子计算应用之一。与超导结构上的领先实验相比,我们的布局合成器将22量子位QAOA中所需的原生双量子位门的数量减少了5.72倍(几何量)。结合更好的相干时间,电路保真度有一个数量级的提高。
{"title":"Qubit Mapping for Reconfigurable Atom Arrays","authors":"Bochen Tan, D. Bluvstein, M. Lukin, J. Cong","doi":"10.1145/3508352.3549331","DOIUrl":"https://doi.org/10.1145/3508352.3549331","url":null,"abstract":"Because of the largest number of qubits available, and the massive parallel execution of entangling two-qubit gates, atom arrays is a promising platform for quantum computing. The qubits are selectively loaded into arrays of optical traps, some of which can be moved during the computation itself. By adjusting the locations of the traps and shining a specific global laser, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. In comparison, previous QC architectures only generate entanglement on a fixed set of quantum register pairs. Thus, reconfigurable atom arrays (RAA) present a new challenge for QC compilation, especially the qubit mapping/layout synthesis stage which decides the qubit placement and gate scheduling. In this paper, we consider an RAA QC architecture that contains multiple arrays, supports 2D array movements, represents cutting-edge experimental platforms, and is much more general than previous works. We start by systematically examining the fundamental constraints on RAA imposed by physics. Built upon this understanding, we discretize the state space of the architecture, and we formulate layout synthesis for such an architecture to a satisfactory modulo theories problem. Finally, we demonstrate our work by compiling the quantum approximate optimization algorithm (QAOA), one of the promising near-term quantum computing applications. Our layout synthesizer reduces the number of required native two-qubit gates in 22-qubit QAOA by 5.72x (geomean) compared to leading experiments on a superconducting architecture. Combined with a better coherence time, there is an order-of-magnitude increase in circuit fidelity.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114393541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Fine-Granular Computation and Data Layout Reorganization for Improving Locality 改进局部性的细粒度计算和数据布局重组
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549386
M. Kandemir, Xulong Tang, Jagadish B. Kotra, Mustafa Karaköy
While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.
虽然数据局部性和缓存性能已经通过先前的研究(在高端系统和嵌入式/移动系统的背景下)进行了深入的研究,但先前方法的一个重要特征是它们将循环和/或数据空间(例如,数组布局)作为一个整体进行转换。不幸的是,这种粗粒度的方法带来了三个关键问题。首先,它们隐式地假设给定数组的所有部分都将同样受益于标识的数据布局转换。其次,它们还假设给定的循环转换对整个数据数组具有相同的局部性影响。第三,更重要的是,这种粗粒度方法本质上是局部的,难以实现全局最优执行。鉴于现有代码和数据空间重组/优化技术的这些缺陷,本文提出为程序中的每个循环巢确定多个循环转换矩阵,并为程序访问的每个数组确定多个数据布局转换,以尝试在更细的粒度上利用数据局部性。它利用二部图匹配,并将所提出的细粒度集成环路布局策略扩展到多核设置。我们的实验结果表明,所提出的方法显着提高了数据局域性,并且优于现有的方案——与最先进的方案相比,单线程执行的平均性能提高了9.1%,多线程执行的平均性能提高了11.5%。
{"title":"Fine-Granular Computation and Data Layout Reorganization for Improving Locality","authors":"M. Kandemir, Xulong Tang, Jagadish B. Kotra, Mustafa Karaköy","doi":"10.1145/3508352.3549386","DOIUrl":"https://doi.org/10.1145/3508352.3549386","url":null,"abstract":"While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115052853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fault-tolerant Deep Learning using Regularization 基于正则化的容错深度学习
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3561120
Biresh Kumar Joardar, Aqeeb Iqbal Arka, J. Doppa, P. Pande
Resistive random-access memory has become one of the most popular choices of hardware implementation for machine learning application workloads. However, these devices exhibit non-ideal behavior, which presents a challenge towards widespread adoption. Training/inferencing on these faulty devices can lead to poor prediction accuracy. However, existing fault tolerant methods are associated with high implementation overheads. In this paper, we present some new directions for solving reliability issues using software solutions. These software-based methods are inherent in deep learning training/inferencing, and they can also be used to address hardware reliability issues as well. These methods prevent accuracy drop during training/inferencing due to unreliable ReRAMs and are associated with lower area and power overheads.
电阻式随机存取存储器已经成为机器学习应用程序工作负载的最流行的硬件实现选择之一。然而,这些设备表现出不理想的行为,这对广泛采用提出了挑战。在这些有缺陷的设备上进行训练/推理可能导致预测准确性较差。然而,现有的容错方法与较高的实现开销相关。在本文中,我们提出了使用软件解决方案解决可靠性问题的一些新方向。这些基于软件的方法是深度学习训练/推理所固有的,它们也可以用于解决硬件可靠性问题。这些方法可以防止在训练/推理过程中由于reram不可靠而导致的准确性下降,并且可以降低面积和功耗开销。
{"title":"Fault-tolerant Deep Learning using Regularization","authors":"Biresh Kumar Joardar, Aqeeb Iqbal Arka, J. Doppa, P. Pande","doi":"10.1145/3508352.3561120","DOIUrl":"https://doi.org/10.1145/3508352.3561120","url":null,"abstract":"Resistive random-access memory has become one of the most popular choices of hardware implementation for machine learning application workloads. However, these devices exhibit non-ideal behavior, which presents a challenge towards widespread adoption. Training/inferencing on these faulty devices can lead to poor prediction accuracy. However, existing fault tolerant methods are associated with high implementation overheads. In this paper, we present some new directions for solving reliability issues using software solutions. These software-based methods are inherent in deep learning training/inferencing, and they can also be used to address hardware reliability issues as well. These methods prevent accuracy drop during training/inferencing due to unreliable ReRAMs and are associated with lower area and power overheads.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125870536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforcement Learning and DEAR Framework for Solving the Qubit Mapping Problem 解决量子比特映射问题的强化学习和DEAR框架
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549472
Ching-Yao Huang, C. Lien, Wai-Kei Mak
Quantum computing is gaining more and more attention due to its huge potential and the constant progress in quantum computer development. IBM and Google have released quantum architectures with more than 50 qubits. However, in these machines, the physical qubits are not fully connected so that two-qubit interaction can only be performed between specific pairs of the physical qubits. To execute a quantum circuit, it is necessary to transform it into a functionally equivalent one that respects the constraints imposed by the target architecture. Quantum circuit transformation inevitably introduces additional gates which reduces the fidelity of the circuit. Therefore, it is important that the transformation method completes the transformation with minimal overheads. It consists of two steps, initial mapping and qubit routing. Here we propose a reinforcement learning-based model to solve the initial mapping problem. Initial mapping is formulated as sequence-to-sequence learning and self- attention network is used to extract features from a circuit. For qubit routing, a DEAR (Dynamically-Extract-and-Route) framework is proposed. The framework iteratively extracts a subcircuit and uses A* search to determine when and where to insert additional gates. It helps to preserve the lookahead ability dynamically and to provide more accurate cost estimation efficiently during A* search. The experimental results show that our RL-model generates better initial mappings than the best known algorithms with 12% fewer additional gates in the qubit routing stage. Furthermore, our DEAR- framework outperforms the state-of-the-art qubit routing approach with 8.4% and 36.3% average reduction in the number of additional gates and execution time starting from the same initial mapping.
量子计算由于其巨大的潜力和量子计算机发展的不断进步而越来越受到人们的关注。IBM和谷歌已经发布了超过50个量子比特的量子架构。然而,在这些机器中,物理量子位并没有完全连接,因此两个量子位的相互作用只能在特定的物理量子位对之间进行。为了执行量子电路,有必要将其转换为功能等效的电路,并尊重目标架构所施加的约束。量子电路变换不可避免地引入了额外的门,降低了电路的保真度。因此,转换方法以最小的开销完成转换是很重要的。它包括两个步骤,初始映射和量子比特路由。在这里,我们提出了一个基于强化学习的模型来解决初始映射问题。初始映射采用序列到序列的学习方法,自关注网络用于提取电路的特征。对于量子比特路由,提出了一个DEAR (dynamic - extraction -and- route)框架。该框架迭代地提取子电路,并使用a *搜索来确定何时何地插入额外的门。它有助于在A*搜索过程中动态地保持前瞻能力,并有效地提供更准确的成本估计。实验结果表明,我们的rl模型在量子比特路由阶段产生的初始映射比最知名的算法更好,并且在量子比特路由阶段减少了12%的附加门。此外,我们的DEAR框架优于最先进的量子比特路由方法,从相同的初始映射开始,额外门的数量和执行时间平均减少了8.4%和36.3%。
{"title":"Reinforcement Learning and DEAR Framework for Solving the Qubit Mapping Problem","authors":"Ching-Yao Huang, C. Lien, Wai-Kei Mak","doi":"10.1145/3508352.3549472","DOIUrl":"https://doi.org/10.1145/3508352.3549472","url":null,"abstract":"Quantum computing is gaining more and more attention due to its huge potential and the constant progress in quantum computer development. IBM and Google have released quantum architectures with more than 50 qubits. However, in these machines, the physical qubits are not fully connected so that two-qubit interaction can only be performed between specific pairs of the physical qubits. To execute a quantum circuit, it is necessary to transform it into a functionally equivalent one that respects the constraints imposed by the target architecture. Quantum circuit transformation inevitably introduces additional gates which reduces the fidelity of the circuit. Therefore, it is important that the transformation method completes the transformation with minimal overheads. It consists of two steps, initial mapping and qubit routing. Here we propose a reinforcement learning-based model to solve the initial mapping problem. Initial mapping is formulated as sequence-to-sequence learning and self- attention network is used to extract features from a circuit. For qubit routing, a DEAR (Dynamically-Extract-and-Route) framework is proposed. The framework iteratively extracts a subcircuit and uses A* search to determine when and where to insert additional gates. It helps to preserve the lookahead ability dynamically and to provide more accurate cost estimation efficiently during A* search. The experimental results show that our RL-model generates better initial mappings than the best known algorithms with 12% fewer additional gates in the qubit routing stage. Furthermore, our DEAR- framework outperforms the state-of-the-art qubit routing approach with 8.4% and 36.3% average reduction in the number of additional gates and execution time starting from the same initial mapping.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128641943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Cache Coherence in Manycore Processor through Silicon Photonic Chiplet 利用硅光子芯片加速多核处理器高速缓存相干性
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549338
Chengeng Li, Fan Jiang, Shixi Chen, Jiaxu Zhang, Yinyi Liu, Yuxiang Fu, Jiang Xu
Cache coherence overhead in manycore systems is becoming prominent with the increase of system scale. However, traditional electrical networks restrict the efficiency of cache coherence transactions in the system due to the limited bandwidth and long latency. Optical network promises high bandwidth and low latency, and supports both efficient unicast and multicast transmission, which can potentially accelerate cache coherence in manycore systems. This work proposes a novel photonic cache coherence network with a physically centralized logically distributed directory called PCCN for chiplet-based manycore systems. PCCN adopts a channel sharing method with a contention solving mechanism for efficient long-distance coherence-related packet transmission. Experiment results show that compared to state-of-the-art proposals, PCCN can speed up application execution time by 1.32x, reduce memory access latency by 26%, and improve energy efficiency by 1.26x, on average, in a 128-core system.
随着系统规模的扩大,多核系统的缓存一致性开销日益突出。然而,传统的电力网络由于带宽有限和延迟长,限制了系统中缓存一致性事务的效率。光网络保证了高带宽和低延迟,并且支持高效的单播和多播传输,可以潜在地提高多核系统中的缓存一致性。这项工作提出了一种新的光子缓存相干网络,具有物理集中的逻辑分布式目录,称为PCCN,用于基于芯片的多核系统。PCCN采用信道共享方式和争用解决机制,实现高效的远程相干包传输。实验结果表明,在128核系统中,与最先进的方案相比,PCCN平均可将应用程序执行时间提高1.32倍,将内存访问延迟降低26%,并将能源效率提高1.26倍。
{"title":"Accelerating Cache Coherence in Manycore Processor through Silicon Photonic Chiplet","authors":"Chengeng Li, Fan Jiang, Shixi Chen, Jiaxu Zhang, Yinyi Liu, Yuxiang Fu, Jiang Xu","doi":"10.1145/3508352.3549338","DOIUrl":"https://doi.org/10.1145/3508352.3549338","url":null,"abstract":"Cache coherence overhead in manycore systems is becoming prominent with the increase of system scale. However, traditional electrical networks restrict the efficiency of cache coherence transactions in the system due to the limited bandwidth and long latency. Optical network promises high bandwidth and low latency, and supports both efficient unicast and multicast transmission, which can potentially accelerate cache coherence in manycore systems. This work proposes a novel photonic cache coherence network with a physically centralized logically distributed directory called PCCN for chiplet-based manycore systems. PCCN adopts a channel sharing method with a contention solving mechanism for efficient long-distance coherence-related packet transmission. Experiment results show that compared to state-of-the-art proposals, PCCN can speed up application execution time by 1.32x, reduce memory access latency by 26%, and improve energy efficiency by 1.26x, on average, in a 128-core system.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128650480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-grained Pruning 基于结构化细粒度剪枝的FPGA实现高性能、精确的BNN推断
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549368
Keqi Fu, Zhi Qi, Jiaxuan Cai, Xulong Shi
As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine binarization with pruning techniques, further exploring the redundancy of BNNs. However, coarse-grained pruning methods may cause server accuracy drops, while traditional fine-grained ones induce irregular sparsity hard to be utilized by hardware. In this paper, we propose two advanced fine-grained BNN pruning modules, i.e., structured channel-wise kernel pruning and dynamic spatial pruning, from a joint perspective of algorithm and hardware. The pruned BNN models are trained from scratch and present not only a higher precision but also a high degree of parallelism. Then, we develop an accelerator architecture that can effectively exploit the sparsity caused by our algorithm. Finally, we implement the pruned BNN models on an embedded FPGA (Ultra96v2). The results show that our software and hardware codesign achieves 5.4x inference-speedup than the baseline BNN, with higher resource and energy efficiency compared with prior FPGA implemented BNN works.
作为量化网络的极端情况,二进制神经网络(BNNs)由于其在存储和计算方面具有许多硬件友好的特性而受到了极大的关注。为了达到紧凑模型的极限,我们尝试将二值化与剪枝技术相结合,进一步探索bnn的冗余性。然而,粗粒度的修剪方法可能会导致服务器精度下降,而传统的细粒度修剪方法则会导致硬件难以利用的不规则稀疏性。本文从算法和硬件的角度出发,提出了结构化通道核修剪和动态空间修剪两种先进的细粒度BNN修剪模块。修剪后的BNN模型是从头开始训练的,不仅具有更高的精度,而且具有高度的并行性。然后,我们开发了一个加速器架构,可以有效地利用我们的算法引起的稀疏性。最后,我们在嵌入式FPGA (Ultra96v2)上实现了修剪后的BNN模型。结果表明,我们的软硬件协同设计比基线BNN的推理速度提高了5.4倍,与现有FPGA实现的BNN工作相比,具有更高的资源和能源效率。
{"title":"Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-grained Pruning","authors":"Keqi Fu, Zhi Qi, Jiaxuan Cai, Xulong Shi","doi":"10.1145/3508352.3549368","DOIUrl":"https://doi.org/10.1145/3508352.3549368","url":null,"abstract":"As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine binarization with pruning techniques, further exploring the redundancy of BNNs. However, coarse-grained pruning methods may cause server accuracy drops, while traditional fine-grained ones induce irregular sparsity hard to be utilized by hardware. In this paper, we propose two advanced fine-grained BNN pruning modules, i.e., structured channel-wise kernel pruning and dynamic spatial pruning, from a joint perspective of algorithm and hardware. The pruned BNN models are trained from scratch and present not only a higher precision but also a high degree of parallelism. Then, we develop an accelerator architecture that can effectively exploit the sparsity caused by our algorithm. Finally, we implement the pruned BNN models on an embedded FPGA (Ultra96v2). The results show that our software and hardware codesign achieves 5.4x inference-speedup than the baseline BNN, with higher resource and energy efficiency compared with prior FPGA implemented BNN works.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114123419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Routability-driven Analytical Placement with Precise Penalty Models for Large-Scale 3D ICs 可达性驱动的分析安置与精确惩罚模型的大规模3D集成电路
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549339
Jai-Ming Lin, Hao-Yuan Hsieh, Hsuan Kung, H. Lin
Quality of a true 3D placement approach greatly relies on the correctness of the models used in its formulation. However, the models used by previous approaches are not precise enough. Moreover, they do not actually place TSVs which makes their approach unable to get accurate wirelength and construct a correct congestion map. Besides, they rarely discuss routability which is the most important issue considered in 2D placement. To resolve this insufficiency, this paper proposes more accurate models to estimate placement utilization and TSV number by the softmax function which can align cells to exact tiers. Moreover, we propose a fast parallel algorithm to update the locations of TSVs when cells are moved during optimization. Finally, we present a novel penalty model to estimate routing overflow of regions covered by cells and inflate cells in congested regions according to this model. Experimental results show that our methodology can obtain better results than previous works.
真正的3D放置方法的质量在很大程度上依赖于其公式中使用的模型的正确性。然而,以前的方法使用的模型不够精确。此外,他们没有实际放置tsv,这使得他们的方法无法获得准确的无线长度和构建正确的拥塞图。此外,他们很少讨论可达性,这是二维布局中考虑的最重要的问题。为了解决这一不足,本文提出了更精确的模型,利用softmax函数估计放置利用率和TSV数,该函数可以将单元对齐到精确的层。此外,我们提出了一种快速并行算法来更新优化过程中单元移动时tsv的位置。最后,我们提出了一种新的惩罚模型来估计蜂窝覆盖区域的路由溢出,并根据该模型对拥挤区域的蜂窝进行膨胀。实验结果表明,该方法比以往的方法获得了更好的结果。
{"title":"Routability-driven Analytical Placement with Precise Penalty Models for Large-Scale 3D ICs","authors":"Jai-Ming Lin, Hao-Yuan Hsieh, Hsuan Kung, H. Lin","doi":"10.1145/3508352.3549339","DOIUrl":"https://doi.org/10.1145/3508352.3549339","url":null,"abstract":"Quality of a true 3D placement approach greatly relies on the correctness of the models used in its formulation. However, the models used by previous approaches are not precise enough. Moreover, they do not actually place TSVs which makes their approach unable to get accurate wirelength and construct a correct congestion map. Besides, they rarely discuss routability which is the most important issue considered in 2D placement. To resolve this insufficiency, this paper proposes more accurate models to estimate placement utilization and TSV number by the softmax function which can align cells to exact tiers. Moreover, we propose a fast parallel algorithm to update the locations of TSVs when cells are moved during optimization. Finally, we present a novel penalty model to estimate routing overflow of regions covered by cells and inflate cells in congested regions according to this model. Experimental results show that our methodology can obtain better results than previous works.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122775966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1