2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献_第9页

A Combined Logical and Physical Attack on Logic Obfuscation 逻辑混淆的逻辑与物理联合攻击

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549349

Michael Zuzak, Yuntao Liu, Isaac McDaniel, A. Srivastava

Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.

逻辑混淆保护集成电路在制造过程中免受不可信的代工厂攻击者的攻击。为了对抗混淆，已经提出了许多逻辑(例如布尔可满足性)和物理(例如光电探测)攻击。根据定义，这些攻击只使用电路泄露信息的一个子集来解锁它。对策通常利用由此产生的盲点来阻止这些攻击，从而限制了它们的可扩展性和通用性。为了克服这个问题，我们提出了一种针对混淆的逻辑和物理相结合的攻击，称为CLAP攻击。CLAP攻击利用锁定电路的逻辑和物理特性，以统一且理论上严格的方式修剪键空间，从而产生更通用且更有效的攻击。为了制定CLAP攻击的物理部分，我们推导了一个逻辑公式，该公式可证明识别能够敏化电路中逻辑表达区域的输入序列。我们证明了电光探测这些区域可以推断出部分密钥。对于攻击的逻辑部分，我们将物理攻击结果集成到布尔可满足性攻击中，以找到正确的密钥。我们通过在基准电路中启动四种混淆方案来评估CLAP攻击。攻击的物理部分完全指定了60.6%的密钥位，部分指定了另外10.3%。攻击的逻辑部分在不到30分钟的时间内在物理攻击限制的keyspace中找到了正确的密钥。因此，尽管混淆，CLAP攻击解锁了每个电路。

{"title":"A Combined Logical and Physical Attack on Logic Obfuscation","authors":"Michael Zuzak, Yuntao Liu, Isaac McDaniel, A. Srivastava","doi":"10.1145/3508352.3549349","DOIUrl":"https://doi.org/10.1145/3508352.3549349","url":null,"abstract":"Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133221772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Sub-Resolution Assist Feature Generation with Reinforcement Learning and Transfer Learning 子分辨率辅助特征生成与强化学习和迁移学习

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549388

Guanhui. Liu, Wei-Chen Tai, Yi-Ting Lin, I. Jiang, J. Shiely, Pu-Jen Cheng

As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.

随着现代光刻特征尺寸的不断缩小，子分辨率辅助特征(SRAF)的生成已成为改善制造工艺窗口的关键分辨率增强技术。最先进的工作诉诸于机器学习来克服基于模型和基于规则的方法的缺陷。然而，这些基于机器学习的方法没有考虑或隐含考虑SRAF之间的光学干扰，并且高度依赖后处理来满足SRAF掩模制造规则。在本文中，我们首次使用强化学习来生成SRAF，以解决SRAF干扰并直接生成符合掩码规则的结果。通过这种方式，我们的两阶段学习使我们能够模拟基于模型的srf的风格，同时进一步改善过程变化(PV)波段。提出了一种状态对齐和动作转换机制，在加速训练过程的同时实现方向等方差。我们还提出了一个迁移学习框架，允许在不同光源下生成SRAF而无需重新训练模型。与目前的研究成果相比，我们的方法提高了PV波段和边缘放置误差(EPE)的求解质量，同时缩短了总体运行时间。

{"title":"Sub-Resolution Assist Feature Generation with Reinforcement Learning and Transfer Learning","authors":"Guanhui. Liu, Wei-Chen Tai, Yi-Ting Lin, I. Jiang, J. Shiely, Pu-Jen Cheng","doi":"10.1145/3508352.3549388","DOIUrl":"https://doi.org/10.1145/3508352.3549388","url":null,"abstract":"As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124777595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Numerically-Stable and Highly-Scalable Parallel LU Factorization for Circuit Simulation 用于电路仿真的数值稳定和高度可扩展的并行LU分解

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549337

Xiaoming Chen

A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to "guess" the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the "guessed" dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.

在电路仿真过程中，采用稀疏LU分解方法求解了许多稀疏线性系统。这些线性系统的系数矩阵结构相同，但值不同。在稀疏逻辑单元分解过程中，通常需要使用旋转来保证数值稳定性，这导致调度逻辑单元分解时难以准确预测依赖关系。然而，矩阵值通常在电路仿真迭代中平滑变化，这提供了“猜测”依赖关系的可能性。本文提出了一种新的具有旋转约简的并行LU分解算法，但其数值稳定性等同于具有旋转的LU分解算法。基本思想是尽可能重用以前的结构和旋转信息，以执行高度可伸缩的并行分解，而不需要旋转，这是由“猜测的”依赖关系调度的。一旦发现一个枢轴太小，剩余的矩阵就会以流水线的方式进行因式分解。综合实验，包括在66个电路矩阵上与最先进的基于CPU和gpu的并行稀疏直接求解器进行比较，以及在4个电路网络上的真实SPICE DC模拟，显示了所提出算法的优越性能和可扩展性。建议的求解器可在https://github.com/chenxm1986/cktso上获得。

{"title":"Numerically-Stable and Highly-Scalable Parallel LU Factorization for Circuit Simulation","authors":"Xiaoming Chen","doi":"10.1145/3508352.3549337","DOIUrl":"https://doi.org/10.1145/3508352.3549337","url":null,"abstract":"A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to \"guess\" the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the \"guessed\" dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114059677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

False Data Injection Attacks on Sensor Systems 传感器系统的虚假数据注入攻击

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3561098

D. Serpanos

False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.

对传感器系统的虚假数据注入攻击是对网络物理系统的一种新兴威胁，对所有应用领域，尤其是关键基础设施造成重大风险。网络物理系统是过程相关的，导致针对特定过程(工厂)中断的不同虚假数据注入攻击。我们提出了假数据注入攻击的分类，使用网络物理系统的通用模型，表明全局和连续攻击是非常强大的。为了检测虚假数据注入攻击，我们描述了三种方法，可以用来在工厂运行期间有效监控和检测虚假数据注入攻击。考虑到传感器故障与相对的假数据注入攻击具有同等的影响，该方法对于传感器故障检测也是有效的。

引用次数: 0

Why are Graph Neural Networks Effective for EDA Problems? 为什么图神经网络对EDA问题有效?

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3561093

Haoxing Ren, S. Nath, Yanqing Zhang, Hao Chen, Mingjie Liu

In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.

在本文中，我们讨论了图形神经网络(gnn)在EDA中的有效性来源，特别是在VLSI设计自动化领域。我们认为，这种有效性来自于gnn隐式嵌入与给定VLSI任务相关的先验知识和归纳偏差的事实，这是使学习算法具有物理信息的三种方法之一。这些归纳偏差不同于为其他结构化数据(如社交网络和引文网络)设计的gnn中常用的归纳偏差。我们将用VLSI领域最近的几个GNN示例来说明这一原理，包括预测任务，如开关活动预测、时序预测、寄生预测、布局对称性预测，以及优化任务，如栅极尺寸和宏和单元晶体管放置。我们还将讨论GNN应用的挑战，以及将GNN应用于VLSI优化的自监督学习技术的机会。

引用次数: 7

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators 齐林:基于FPGA加速器的共享虚拟内存系统性能分析与优化

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549431

Edward Richter, Deming Chen

While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the over-heads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.

虽然异构系统中组件的紧密集成提高了共享虚拟内存(SVM)系统编程模型的流行程度，但SVM的开销会显著影响端到端应用程序的性能。然而，研究支持向量机的实现是困难的，因为没有一个开放和灵活的系统来探索不同支持向量机实现之间的权衡，支持向量机的设计空间也没有明确定义。为此，我们提出了Qilin，这是第一个能够在离散加速器的异构计算环境中深入研究SVM的开源系统。Qilin是一个建立在开源FPGA外壳之上的透明灵活的系统，它允许研究人员改变底层SVM实现的组件，以了解SVM设计决策如何影响性能。使用Qilin，我们对三种支持向量机架构的开销进行了广泛的定量分析，并生成了一些突出每个架构的成本和收益的见解。根据这些见解，我们提出了一个流程图，说明如何根据应用特性和系统的支持向量机功能选择最佳的支持向量机实现。麒麟还为应用程序开发人员提供了灵活的SVM外壳，用于高性能虚拟化应用程序。与开源FPGA外壳相比，麒麟启用的优化可以将转换延迟减少6.86倍。

{"title":"Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators","authors":"Edward Richter, Deming Chen","doi":"10.1145/3508352.3549431","DOIUrl":"https://doi.org/10.1145/3508352.3549431","url":null,"abstract":"While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the over-heads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126668122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Robust Quantum Layout Synthesis Algorithm with a Qubit Mapping Checker* 一种具有量子比特映射检查器的鲁棒量子布局综合算法*

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549394

Tsou-An Wu, Yun-Jhe Jiang, Shao-Yun Fang

Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP- free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000X runtime speedup over a state- of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800X speedup and obtains near-optimal solutions with only 3% SWAP overhead.

量子电路中的布局合成是将合成电路的逻辑量子位映射到硬件设备的物理量子位上(耦合图)，并符合硬件的限制。现有的关于这一问题的研究通常存在难以处理的公式复杂性，因而运行时间长得令人望而却步。本文提出了一种基于可满足模理论(SMT)的高效布局合成器。所提出的量子比特映射检查器可以有效地推导出SWAP无解。如果一个电路不存在无SWAP的解，我们提出一个分治方案，利用检查器找到子电路的无SWAP的子解，并通过合并子解和SWAP插入来找到总体解。实验结果表明，所提出的优化流程在求解一组无swap电路的最优解时，比目前最先进的工作速度提高了3000倍以上。此外，对于另一组需要SWAP门的基准电路，我们的流程实现了超过800X的加速，并且仅以3%的SWAP开销获得了接近最优的解决方案。

{"title":"A Robust Quantum Layout Synthesis Algorithm with a Qubit Mapping Checker*","authors":"Tsou-An Wu, Yun-Jhe Jiang, Shao-Yun Fang","doi":"10.1145/3508352.3549394","DOIUrl":"https://doi.org/10.1145/3508352.3549394","url":null,"abstract":"Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP- free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000X runtime speedup over a state- of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800X speedup and obtains near-optimal solutions with only 3% SWAP overhead.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"16 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126233841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

WSQ-AdderNet: Efficient Weight Standardization based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization WSQ-AdderNet:基于高效权重标准化的量化AdderNet FPGA加速器设计与高密度INT8 DSP-LUT共封装优化

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549439

Yunxiang Zhang, Biao Sun, Weixiong Jiang, Y. Ha, Miao Hu, Wenfeng Zhao

Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.

卷积神经网络(cnn)已被广泛应用于各种机器智能任务。然而，由于卷积核涉及昂贵的乘法-累积(MAC)操作，cnn仍然被认为是计算要求很高的。最近关于硬件最优神经网络架构的建议表明，AdderNet具有轻量级的基于1范数的特征提取内核，可以有效地替代CNN，其中昂贵的MAC操作被高效的绝对差和(SAD)操作取代。然而，与现有的cnn方法相比，AdderNet缺乏一种有效的硬件实现方法，包括有效的量化、全整数加速器的实现以及对FPGA器件的DSP切片的明智的资源利用。在本文中，我们提出了WSQ-AdderNet，一个通用框架，用于量化和优化嵌入式FPGA器件上基于addernet的加速器设计。首先，我们提出了一种权值标准化技术来促进AdderNet中的权值量化。其次，我们展示了一个全整数量化硬件实现策略，包括权重和激活量化方法。第三，我们应用DSP封装优化来最大化DSP利用效率，其中Octo-INT8可以通过DSP- lut共封装来实现。最后，我们利用Xilinx Vitis HLS(高级合成)和Vivado对Xilinx Kria KV-260 FPGA进行了设计实现。我们使用WSQ-AdderNet在ResNet-20上的实验结果表明，与INT8实现相比，实现的推理准确率达到89.9%，与FP32和INT8 CNN设计相比，性能损失很小。在硬件层面，WSQ-AdderNet实现了高达3.39倍的DSP密度改进，与INT8 CNN设计相比，吞吐量几乎相同。DSP利用率的降低使得在资源受限的设备上部署大型网络模型成为可能。当进一步扩大PE尺寸39.8%时，WSQ-AdderNet可以实现1.48倍的吞吐量改进，同时仍然实现2.42倍的DSP密度改进。

{"title":"WSQ-AdderNet: Efficient Weight Standardization based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization","authors":"Yunxiang Zhang, Biao Sun, Weixiong Jiang, Y. Ha, Miao Hu, Wenfeng Zhao","doi":"10.1145/3508352.3549439","DOIUrl":"https://doi.org/10.1145/3508352.3549439","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122044892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating N-bit Operations over TFHE on Commodity CPU-FPGA 在商用CPU-FPGA的TFHE上加速n位运算

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549413

Kevin Nam, Hyunyoung Oh, Hyungon Moon, Y. Paek

TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.

TFHE是一种完全同态加密(FHE)方案，它在加密数据上评估布尔门，我们将在后面称之为门。TFHE被认为比许多现有方案具有更高的表达能力，因为它不仅可以计算n位算术运算，而且可以计算逻辑/关系运算，因为任意ALR运算可以用Tgate电路表示。尽管有这样的优势，TFHE也有一个缺点，像所有其他方案一样，它的计算开销巨大。通过利用对密文的FHE操作的固有并行性，不断努力减少开销。与其他FHE方案不同，TFHE的并行性可以分解为多层:一个在每个FHE操作内部(相当于单个Tgate)，另一个在gate之间。不幸的是，以前的作品只关注于利用Tgate内部的并行性。然而，由于TFHE上的每个n位操作对应于由多个门构成的Tgate电路，因此也有必要利用门之间的并行性来优化整个操作。本文提出了一种加速技术，通过同时利用运算的并行性来最大化TFHE n位运算的性能。为了充分利用这两层并行性，我们在硬件上具有并行执行能力的普通CPU-FPGA混合机器上实现了我们的技术。当在128位量子安全参数下执行n位操作时，我们的实现比以前的吞吐量高2.43倍，每瓦吞吐量高12.19倍。

{"title":"Accelerating N-bit Operations over TFHE on Commodity CPU-FPGA","authors":"Kevin Nam, Hyunyoung Oh, Hyungon Moon, Y. Paek","doi":"10.1145/3508352.3549413","DOIUrl":"https://doi.org/10.1145/3508352.3549413","url":null,"abstract":"TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115045801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

2022 CAD Contest Problem A: Learning Arithmetic Operations from Gate-Level Circuit 2022年CAD竞赛题目A:从门级电路学习算术运算

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3561107

Chung-Han Chou, Chih-Jen Hsu, Chi-An Wu, Kuan-Hua Tu

Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.

从门级网表中提取电路功能在CAD工具中是至关重要的。为了安全起见，它可以帮助设计人员检测硬件木马或恶意设计更改的网络列表与第三方资源，如制造服务和软/硬IP核。对于验证，它可以减少合成工具采用的激进优化策略中保持设计信息的复杂性和工作量。对于工程变更单(ECO)，它可以使设计人员避免将ECO门定位在一堆位级门中。在这次比赛中，我们制定了一个数据路径学习和提取问题。通过一组基准和评估指标，我们期望参赛者开发一种工具来学习合成门级网表中的算术方程。

引用次数: 2