Michael Zuzak, Yuntao Liu, Isaac McDaniel, A. Srivastava
Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.
{"title":"A Combined Logical and Physical Attack on Logic Obfuscation","authors":"Michael Zuzak, Yuntao Liu, Isaac McDaniel, A. Srivastava","doi":"10.1145/3508352.3549349","DOIUrl":"https://doi.org/10.1145/3508352.3549349","url":null,"abstract":"Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133221772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanhui. Liu, Wei-Chen Tai, Yi-Ting Lin, I. Jiang, J. Shiely, Pu-Jen Cheng
As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.
{"title":"Sub-Resolution Assist Feature Generation with Reinforcement Learning and Transfer Learning","authors":"Guanhui. Liu, Wei-Chen Tai, Yi-Ting Lin, I. Jiang, J. Shiely, Pu-Jen Cheng","doi":"10.1145/3508352.3549388","DOIUrl":"https://doi.org/10.1145/3508352.3549388","url":null,"abstract":"As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124777595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to "guess" the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the "guessed" dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.
{"title":"Numerically-Stable and Highly-Scalable Parallel LU Factorization for Circuit Simulation","authors":"Xiaoming Chen","doi":"10.1145/3508352.3549337","DOIUrl":"https://doi.org/10.1145/3508352.3549337","url":null,"abstract":"A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to \"guess\" the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the \"guessed\" dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114059677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.
{"title":"False Data Injection Attacks on Sensor Systems","authors":"D. Serpanos","doi":"10.1145/3508352.3561098","DOIUrl":"https://doi.org/10.1145/3508352.3561098","url":null,"abstract":"False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115624769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoxing Ren, S. Nath, Yanqing Zhang, Hao Chen, Mingjie Liu
In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.
{"title":"Why are Graph Neural Networks Effective for EDA Problems?","authors":"Haoxing Ren, S. Nath, Yanqing Zhang, Hao Chen, Mingjie Liu","doi":"10.1145/3508352.3561093","DOIUrl":"https://doi.org/10.1145/3508352.3561093","url":null,"abstract":"In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the over-heads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.
{"title":"Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators","authors":"Edward Richter, Deming Chen","doi":"10.1145/3508352.3549431","DOIUrl":"https://doi.org/10.1145/3508352.3549431","url":null,"abstract":"While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the over-heads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126668122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP- free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000X runtime speedup over a state- of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800X speedup and obtains near-optimal solutions with only 3% SWAP overhead.
{"title":"A Robust Quantum Layout Synthesis Algorithm with a Qubit Mapping Checker*","authors":"Tsou-An Wu, Yun-Jhe Jiang, Shao-Yun Fang","doi":"10.1145/3508352.3549394","DOIUrl":"https://doi.org/10.1145/3508352.3549394","url":null,"abstract":"Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP- free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000X runtime speedup over a state- of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800X speedup and obtains near-optimal solutions with only 3% SWAP overhead.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"16 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126233841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunxiang Zhang, Biao Sun, Weixiong Jiang, Y. Ha, Miao Hu, Wenfeng Zhao
Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.
{"title":"WSQ-AdderNet: Efficient Weight Standardization based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization","authors":"Yunxiang Zhang, Biao Sun, Weixiong Jiang, Y. Ha, Miao Hu, Wenfeng Zhao","doi":"10.1145/3508352.3549439","DOIUrl":"https://doi.org/10.1145/3508352.3549439","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122044892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.
{"title":"Accelerating N-bit Operations over TFHE on Commodity CPU-FPGA","authors":"Kevin Nam, Hyunyoung Oh, Hyungon Moon, Y. Paek","doi":"10.1145/3508352.3549413","DOIUrl":"https://doi.org/10.1145/3508352.3549413","url":null,"abstract":"TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115045801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chung-Han Chou, Chih-Jen Hsu, Chi-An Wu, Kuan-Hua Tu
Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.
{"title":"2022 CAD Contest Problem A: Learning Arithmetic Operations from Gate-Level Circuit","authors":"Chung-Han Chou, Chih-Jen Hsu, Chi-An Wu, Kuan-Hua Tu","doi":"10.1145/3508352.3561107","DOIUrl":"https://doi.org/10.1145/3508352.3561107","url":null,"abstract":"Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122464941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}