2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献_第8页

Split Compilation for Security of Quantum Circuits 量子电路安全的分离编译

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643478

Abdullah Ash-Saki, A. Suresh, R. Topaloglu, Swaroop Ghosh

An efficient quantum circuit (program) compiler aims to minimize the gate-count - through efficient instruction translation, routing, gate, and cancellation - to improve run-time and noise. Therefore, a high-efficiency compiler is paramount to enable the game-changing promises of quantum computers. To date, the quantum computing hardware providers are offering a software stack supporting their hardware. However, several third-party software toolchains, including compilers, are emerging. They support hardware from different vendors and potentially offer better efficiency. As the quantum computing ecosystem becomes more popular and practical, it is only prudent to assume that more companies will start offering software-as-a-service for quantum computers, including high-performance compilers. With the emergence of third-party compilers, the security and privacy issues of quantum intellectual properties (IPs) will follow. A quantum circuit can include sensitive information such as critical financial analysis and proprietary algorithms. Therefore, submitting quantum circuits to untrusted compilers creates opportunities for adversaries to steal IPs. In this paper, we present a split compilation methodology to secure IPs from untrusted compilers while taking advantage of their optimizations. In this methodology, a quantum circuit is split into multiple parts that are sent to a single compiler at different times or to multiple compilers. In this way, the adversary has access to partial information. With analysis of over 152 circuits on three IBM hardware architectures, we demonstrate the split compilation methodology can completely secure IPs (when multiple compilers are used) or can introduce factorial time reconstruction complexity while incurring a modest overhead (~ 3% to ~ 6% on average).

高效的量子电路(程序)编译器旨在通过高效的指令转换、路由、门和取消来最小化门计数，从而改善运行时间和噪声。因此，高效编译器对于实现量子计算机改变游戏规则的承诺至关重要。迄今为止，量子计算硬件提供商正在提供支持其硬件的软件堆栈。然而，包括编译器在内的一些第三方软件工具链正在出现。它们支持来自不同厂商的硬件，并可能提供更高的效率。随着量子计算生态系统变得越来越流行和实用，我们只能谨慎地假设，更多的公司将开始为量子计算机提供软件即服务，包括高性能编译器。随着第三方编译器的出现，量子知识产权(ip)的安全和隐私问题也将随之而来。量子电路可以包含敏感信息，如关键的财务分析和专有算法。因此，将量子电路提交给不受信任的编译器会为对手窃取ip创造机会。在本文中，我们提出了一种分离编译方法，以保护ip免受不受信任的编译器的攻击，同时利用它们的优化。在这种方法中，量子电路被分成多个部分，这些部分在不同的时间被发送到单个编译器或多个编译器。通过这种方式，攻击者可以获得部分信息。通过对三种IBM硬件体系结构上超过152个电路的分析，我们证明了拆分编译方法可以完全保护ip(当使用多个编译器时)，或者可以引入阶乘时间重构复杂性，同时产生适度的开销(平均约3%至6%)。

{"title":"Split Compilation for Security of Quantum Circuits","authors":"Abdullah Ash-Saki, A. Suresh, R. Topaloglu, Swaroop Ghosh","doi":"10.1109/ICCAD51958.2021.9643478","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643478","url":null,"abstract":"An efficient quantum circuit (program) compiler aims to minimize the gate-count - through efficient instruction translation, routing, gate, and cancellation - to improve run-time and noise. Therefore, a high-efficiency compiler is paramount to enable the game-changing promises of quantum computers. To date, the quantum computing hardware providers are offering a software stack supporting their hardware. However, several third-party software toolchains, including compilers, are emerging. They support hardware from different vendors and potentially offer better efficiency. As the quantum computing ecosystem becomes more popular and practical, it is only prudent to assume that more companies will start offering software-as-a-service for quantum computers, including high-performance compilers. With the emergence of third-party compilers, the security and privacy issues of quantum intellectual properties (IPs) will follow. A quantum circuit can include sensitive information such as critical financial analysis and proprietary algorithms. Therefore, submitting quantum circuits to untrusted compilers creates opportunities for adversaries to steal IPs. In this paper, we present a split compilation methodology to secure IPs from untrusted compilers while taking advantage of their optimizations. In this methodology, a quantum circuit is split into multiple parts that are sent to a single compiler at different times or to multiple compilers. In this way, the adversary has access to partial information. With analysis of over 152 circuits on three IBM hardware architectures, we demonstrate the split compilation methodology can completely secure IPs (when multiple compilers are used) or can introduce factorial time reconstruction complexity while incurring a modest overhead (~ 3% to ~ 6% on average).","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126820779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Binarized SNNs: Efficient and Error-Resilient Spiking Neural Networks through Binarization 二值化snn:基于二值化的高效抗错峰值神经网络

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643463

M. Wei, Mikail Yayla, S. Ho, Jian-Jia Chen, Chia-Lin Yang, H. Amrouch

Spiking Neural Networks (SNNs) are considered the third generation of NNs and can reach similar accuracy as conventional deep NNs, but with a considerable improvement in efficiency. However, to achieve high accuracy, state-of-the-art SNNs employ stochastic spike coding of the inputs, requiring multiple cycles of computation. Because of this and due to the nature of analog computing, it is required to accumulate and hold the charges of multiple cycles, necessitating a large membrane capacitor. This results in high energy, long latency, and expensive area costs, constituting one of the major bottlenecks in analog SNN implementations. Membrane capacitor size determines the precision of the firing time. Hence reducing the capacitor size considerably degrades the inference accuracy. To alleviate this, we focus on bridging the gap between binarized NNs (BNNs) and SNNs. BNNs are rapidly emerging as an attractive alternative for NNs due to their high efficiency and error tolerance. In this work, we evaluate the impact of deploying error-resilient BNNs, i.e. BNNs that have been proactively trained in the presence of errors, on analog implementation of SNNs. We show that for BNNs, the capacitor size and latency can be reduced significantly compared to state-of-the-art SNNs, which employ multi-bit models. Our experiments demonstrate that when error-resilient BNNs are deployed on analog-based SNN accelerator, the size of the membrane capacitor is reduced by 50%, the inference latency is decreased by two orders of magnitude, and energy is reduced by 57% compared to the baseline 4-bit SNN implementation, under minimal accuracy cost.

脉冲神经网络(SNNs)被认为是第三代神经网络，它可以达到与传统深度神经网络相似的精度，但在效率上有很大的提高。然而，为了实现高精度，最先进的snn采用输入的随机尖峰编码，需要多个计算周期。由于这一点以及模拟计算的性质，需要积累和保持多个周期的电荷，这就需要一个大的膜电容器。这导致高能量、长延迟和昂贵的区域成本，成为模拟SNN实现的主要瓶颈之一。膜电容器的尺寸决定了烧制时间的精度。因此，减小电容器尺寸大大降低了推理精度。为了缓解这一问题，我们专注于弥合二值化神经网络(bnn)和snn之间的差距。由于其高效率和容错性，神经网络正迅速成为神经网络的一个有吸引力的替代品。在这项工作中，我们评估了部署错误弹性bnn(即在存在错误的情况下主动训练的bnn)对snn模拟实现的影响。我们表明，与采用多比特模型的最先进snn相比，bnn的电容器尺寸和延迟可以显着减小。我们的实验表明，当错误弹性bnn部署在基于模拟的SNN加速器上时，与基线4位SNN实现相比，膜电容器的尺寸减少了50%，推理延迟减少了两个数量级，能量减少了57%，且精度成本最小。

{"title":"Binarized SNNs: Efficient and Error-Resilient Spiking Neural Networks through Binarization","authors":"M. Wei, Mikail Yayla, S. Ho, Jian-Jia Chen, Chia-Lin Yang, H. Amrouch","doi":"10.1109/ICCAD51958.2021.9643463","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643463","url":null,"abstract":"Spiking Neural Networks (SNNs) are considered the third generation of NNs and can reach similar accuracy as conventional deep NNs, but with a considerable improvement in efficiency. However, to achieve high accuracy, state-of-the-art SNNs employ stochastic spike coding of the inputs, requiring multiple cycles of computation. Because of this and due to the nature of analog computing, it is required to accumulate and hold the charges of multiple cycles, necessitating a large membrane capacitor. This results in high energy, long latency, and expensive area costs, constituting one of the major bottlenecks in analog SNN implementations. Membrane capacitor size determines the precision of the firing time. Hence reducing the capacitor size considerably degrades the inference accuracy. To alleviate this, we focus on bridging the gap between binarized NNs (BNNs) and SNNs. BNNs are rapidly emerging as an attractive alternative for NNs due to their high efficiency and error tolerance. In this work, we evaluate the impact of deploying error-resilient BNNs, i.e. BNNs that have been proactively trained in the presence of errors, on analog implementation of SNNs. We show that for BNNs, the capacitor size and latency can be reduced significantly compared to state-of-the-art SNNs, which employ multi-bit models. Our experiments demonstrate that when error-resilient BNNs are deployed on analog-based SNN accelerator, the size of the membrane capacitor is reduced by 50%, the inference latency is decreased by two orders of magnitude, and energy is reduced by 57% compared to the baseline 4-bit SNN implementation, under minimal accuracy cost.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128199812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks on Deep Neural Networks HASHTAG:用于深度神经网络故障注入攻击在线检测的哈希签名

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643556

Mojan Javaheripi, F. Koushanfar

We propose Hashtag, the first framework that enables high-accuracy detection of fault-injection attacks on Deep Neural Networks (DNNs) with provable bounds on detection performance. Recent literature in fault-injection attacks shows the severe DNN accuracy degradation caused by bit flips. In this scenario, the attacker changes a few weight bits during DNN execution by tampering with the program's DRAM memory. To detect runtime bit flips, Hashtag extracts a unique signature from the benign DNN prior to deployment. The signature is later used to validate the integrity of the DNN and verify the inference output on the fly. We propose a novel sensitivity analysis scheme that accurately identifies the most vulnerable DNN layers to the fault-injection attack. The DNN signature is then constructed by encoding the underlying weights in the vulnerable layers using a low-collision hash function. When the DNN is deployed, new hashes are extracted from the target layers during inference and compared against the ground-truth signatures. Hashtag incorporates a lightweight methodology that ensures a low-overhead and real-time fault detection on embedded platforms. Extensive evaluations with the state-of-the-art bit-flip attack on various DNNs demonstrate the competitive advantage of Hashtag in terms of both attack detection and execution overhead.

我们提出了Hashtag，这是第一个能够高精度检测深度神经网络(dnn)上的故障注入攻击的框架，具有可证明的检测性能界限。最近关于故障注入攻击的文献表明，比特翻转导致深度神经网络精度严重下降。在这种情况下，攻击者在DNN执行期间通过篡改程序的DRAM内存来改变几个权重位。为了检测运行时位翻转，Hashtag在部署之前从良性DNN中提取唯一签名。该签名随后用于验证DNN的完整性，并动态验证推理输出。我们提出了一种新的灵敏度分析方案，可以准确地识别出最容易受到故障注入攻击的深层神经网络层。然后通过使用低碰撞哈希函数对脆弱层中的底层权重进行编码来构建DNN签名。当部署DNN时，在推理过程中从目标层提取新的哈希值，并与基真签名进行比较。Hashtag集成了一种轻量级方法，可确保在嵌入式平台上进行低开销和实时故障检测。对各种dnn的最先进的位翻转攻击进行了广泛的评估，证明了Hashtag在攻击检测和执行开销方面的竞争优势。

{"title":"HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks on Deep Neural Networks","authors":"Mojan Javaheripi, F. Koushanfar","doi":"10.1109/ICCAD51958.2021.9643556","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643556","url":null,"abstract":"We propose Hashtag, the first framework that enables high-accuracy detection of fault-injection attacks on Deep Neural Networks (DNNs) with provable bounds on detection performance. Recent literature in fault-injection attacks shows the severe DNN accuracy degradation caused by bit flips. In this scenario, the attacker changes a few weight bits during DNN execution by tampering with the program's DRAM memory. To detect runtime bit flips, Hashtag extracts a unique signature from the benign DNN prior to deployment. The signature is later used to validate the integrity of the DNN and verify the inference output on the fly. We propose a novel sensitivity analysis scheme that accurately identifies the most vulnerable DNN layers to the fault-injection attack. The DNN signature is then constructed by encoding the underlying weights in the vulnerable layers using a low-collision hash function. When the DNN is deployed, new hashes are extracted from the target layers during inference and compared against the ground-truth signatures. Hashtag incorporates a lightweight methodology that ensures a low-overhead and real-time fault detection on embedded platforms. Extensive evaluations with the state-of-the-art bit-flip attack on various DNNs demonstrate the competitive advantage of Hashtag in terms of both attack detection and execution overhead.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123862305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

DARe: DropLayer-Aware Manycore ReRAM architecture for Training Graph Neural Networks DARe:用于训练图神经网络的DropLayer-Aware多核ReRAM架构

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643511

Aqeeb Iqbal Arka, Biresh Kumar Joardar, J. Doppa, P. Pande, K. Chakrabarty

Graph Neural Networks (GNNs) are a variant of Deep Neural Networks (DNNs) operating on graphs. GNNs have attributes of both DNNs and graph computation. However, training GNNs on manycore architectures is a challenging task because it involves heavy communication that bottlenecks performance. DropEdge and Dropout, which we collectively refer to as DropLayer, are regularization techniques that can improve the predictive accuracy of GNNs. Moreover, when implemented on a manycore architecture, DropEdge and Dropout are capable of reducing the on-chip traffic. In this paper, we present a ReRAM-based 3D manycore architecture called DARe, tailored for accelerating on-chip training of GNNs. The key component of the DARe architecture is a Network-on-Chip (NoC) that reduces the amount of communication using DropLayer. The reduced traffic prevents communication hotspots and leads to better performance. We demonstrate that DARe outperforms conventional GPUs by up to 6.7X (5.6X on average) in terms of execution time, while being up to 30X (23X on average) more energy efficient for GNN training.

图神经网络(gnn)是深度神经网络(dnn)的一种变体。gnn同时具有深度神经网络和图计算的属性。然而，在多核架构上训练gnn是一项具有挑战性的任务，因为它涉及到大量的通信，从而限制了性能。DropEdge和Dropout，我们统称为DropLayer，是可以提高gnn预测精度的正则化技术。此外，当在多核架构上实现时，DropEdge和Dropout能够减少片上流量。在本文中，我们提出了一种基于reram的3D多核架构，称为DARe，专门用于加速gnn的片上训练。DARe架构的关键组件是片上网络(NoC)，它减少了使用DropLayer的通信量。减少的流量减少了通信热点，提高了性能。我们证明，在执行时间方面，DARe比传统gpu高出6.7倍(平均5.6倍)，而在GNN训练方面，其能效高达30倍(平均23倍)。

{"title":"DARe: DropLayer-Aware Manycore ReRAM architecture for Training Graph Neural Networks","authors":"Aqeeb Iqbal Arka, Biresh Kumar Joardar, J. Doppa, P. Pande, K. Chakrabarty","doi":"10.1109/ICCAD51958.2021.9643511","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643511","url":null,"abstract":"Graph Neural Networks (GNNs) are a variant of Deep Neural Networks (DNNs) operating on graphs. GNNs have attributes of both DNNs and graph computation. However, training GNNs on manycore architectures is a challenging task because it involves heavy communication that bottlenecks performance. DropEdge and Dropout, which we collectively refer to as DropLayer, are regularization techniques that can improve the predictive accuracy of GNNs. Moreover, when implemented on a manycore architecture, DropEdge and Dropout are capable of reducing the on-chip traffic. In this paper, we present a ReRAM-based 3D manycore architecture called DARe, tailored for accelerating on-chip training of GNNs. The key component of the DARe architecture is a Network-on-Chip (NoC) that reduces the amount of communication using DropLayer. The reduced traffic prevents communication hotspots and leads to better performance. We demonstrate that DARe outperforms conventional GPUs by up to 6.7X (5.6X on average) in terms of execution time, while being up to 30X (23X on average) more energy efficient for GNN training.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116133461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

2021 ICCAD CAD Contest Problem C: GPU Accelerated Logic Rewriting 2021 ICCAD CAD竞赛题目C: GPU加速逻辑重写

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643521

G. Pasandi, Sreedhar Pratty, David Brown, Yanqing Zhang, Haoxing Ren, Brucek Khailany

Logic rewriting is an important optimization function that can improve Quality of Results (QoR) in modern VLSI circuits. This optimization function usually has a greedy approach and involves steps such as graph traversal, cut computation and ranking, and functional matching. For logic rewriting to be effective in improving the QoR, there should be many local rewriting iterations which can be very slow for industrial level benchmark circuits. One effective solution to speed up the logic rewriting operation is to upload its time consuming steps to Graphics Processing Units (GPUs) to benefit from massively parallel computations that is available there. In this regard, the present contest problem studies the possibility of using GPUs in accelerating a classical logic rewriting function. State-of-the-art large-scale open-source benchmark circuits as well as industrial-level designs will be used to test the GPU accelerated logic rewriting function.

逻辑改写是现代VLSI电路中提高结果质量的重要优化功能。这种优化函数通常采用贪婪的方法，涉及图遍历、切计算和排序以及函数匹配等步骤。为了使逻辑重写有效地提高QoR，应该有许多局部重写迭代，这对于工业级基准电路来说可能非常缓慢。加速逻辑重写操作的一个有效解决方案是将耗时的步骤上传到图形处理单元(gpu)，以便从那里可用的大规模并行计算中获益。在这方面，本竞赛问题研究了使用gpu加速经典逻辑重写函数的可能性。最先进的大规模开源基准电路以及工业级设计将用于测试GPU加速逻辑重写功能。

引用次数: 3

Circuit Deobfuscation from Power Side-Channels using Pseudo-Boolean SAT 伪布尔SAT对电源侧信道的去混淆

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643495

Kaveh Shamsi, Yier Jin

The problem of inferring the value of internal nets in a circuit from its power side-channels has been the topic of extensive research over the past two decades, with several frameworks developed mostly focusing on cryptographic hardware. In this paper, we focus on the problem of breaking logic locking, a technique in which an original circuit is made ambiguous by inserting unknown “key” bits into it, via power side-channels. We present a pair of attack algorithms we term PowerSAT attacks, which take in arbitrary keyed circuits and resolve key information by interacting adaptively with a side-channel “oracle”. They are based on the query-by-disagreement scheme used in functional SAT attacks against locking but utilize Psuedo-Boolean constraints to allow for reasoning about hamming-weight power models. We present a software implementation of the attacks along with techniques for speeding them up. We present simulation and FPGA-based experiments as well. Notably, we demonstrate the extraction of a 32-bit key from a comparator circuit with a $2^{31}$ functional query complexity, in $sim 64$ chosen power side-channel queries using the PowerSAT attack, where traditional CPA fails given 1000 random traces. We release a binary of our implementation along with the FPGA $+mathbf{scope} mathbf{HDL}/mathbf{setup}$ used for the experiments.

在过去的二十年里，从电路的功率侧信道推断电路内部网络的价值的问题一直是广泛研究的主题，已经开发了几个主要关注加密硬件的框架。在本文中，我们关注的是打破逻辑锁定的问题，这是一种通过功率侧信道将未知的“密钥”位插入原始电路而使其模糊的技术。我们提出了一对我们称之为PowerSAT攻击的攻击算法，它们采用任意键控电路并通过自适应地与侧信道“oracle”交互来解析密钥信息。它们基于针对锁定的功能性SAT攻击中使用的按分歧查询方案，但利用伪布尔约束来允许对锤击权值模型进行推理。我们提出了一种攻击的软件实现以及加速攻击的技术。我们也给出了仿真和基于fpga的实验。值得注意的是，我们演示了在使用PowerSAT攻击的$sim $ 64$选择的功率侧信道查询中，从具有$2^{31}$功能查询复杂度的比较器电路中提取32位密钥，其中传统CPA在给定1000个随机跟踪时失败。我们发布了我们实现的二进制文件以及用于实验的FPGA $+mathbf{scope} mathbf{HDL}/mathbf{setup}$。

{"title":"Circuit Deobfuscation from Power Side-Channels using Pseudo-Boolean SAT","authors":"Kaveh Shamsi, Yier Jin","doi":"10.1109/ICCAD51958.2021.9643495","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643495","url":null,"abstract":"The problem of inferring the value of internal nets in a circuit from its power side-channels has been the topic of extensive research over the past two decades, with several frameworks developed mostly focusing on cryptographic hardware. In this paper, we focus on the problem of breaking logic locking, a technique in which an original circuit is made ambiguous by inserting unknown “key” bits into it, via power side-channels. We present a pair of attack algorithms we term PowerSAT attacks, which take in arbitrary keyed circuits and resolve key information by interacting adaptively with a side-channel “oracle”. They are based on the query-by-disagreement scheme used in functional SAT attacks against locking but utilize Psuedo-Boolean constraints to allow for reasoning about hamming-weight power models. We present a software implementation of the attacks along with techniques for speeding them up. We present simulation and FPGA-based experiments as well. Notably, we demonstrate the extraction of a 32-bit key from a comparator circuit with a $2^{31}$ functional query complexity, in $sim 64$ chosen power side-channel queries using the PowerSAT attack, where traditional CPA fails given 1000 random traces. We release a binary of our implementation along with the FPGA $+mathbf{scope} mathbf{HDL}/mathbf{setup}$ used for the experiments.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122688011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU 嵌入式GPU超分辨率处理的高性能加速器

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643472

W. Zhao, Qi Sun, Yang Bai, Wenbo Li, Haisheng Zheng, Bei Yu, Martin D. F. Wong

Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly and can achieve real-time performance.

近年来，超分辨率(SR)处理技术取得了令人瞩目的进展。然而，它的实时推理要求不仅对模型设计提出了挑战，而且对片上实现也提出了挑战。在本文中，我们在嵌入式GPU设备上实现了一个全栈SR加速框架。详细分析了SR模型中使用的特殊字典学习算法，并通过一种新的字典选择策略进行了加速。分析了硬件编程体系结构和模型结构，指导计算核的优化设计，使资源约束下的推理延迟最小化。这些新技术很好地解决了基于深度字典学习的SR模型的通信和计算瓶颈。在边缘嵌入式NVIDIA NX和2080Ti上的实验表明，我们的方法明显优于最先进的NVIDIA TensorRT，可以实现实时性能。

引用次数: 1

GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs 在配备hbm的fpga上加速图线性代数

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643582

Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang

Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.

由于计算对内存的访问比低和数据访问模式不规则，图形处理通常受内存限制。新兴的高带宽内存(HBM)通过提供可以并发处理内存请求的多个通道来提供卓越的带宽，从而有可能显著提高图形处理的性能。本文提出了一种图形线性代数叠加GraphLily来加速fpga上的图形处理。GraphLily通过采用GraphBLAS编程接口支持丰富的图算法集，该接口将图算法表述为稀疏线性代数运算。GraphLily为GraphBLAS中两种广泛使用的内核(即稀疏矩阵密集向量乘法(SpMV)和稀疏矩阵稀疏向量乘法(SpMSpV))提供了高效的内存优化加速器。SpMV加速器使用为HBM量身定制的稀疏矩阵存储格式，支持对每个通道的流式、矢量化访问和对多个通道的并发访问。此外，SpMV加速器通过引入可扩展的片上缓冲器设计，利用密集向量访问中的数据重用。SpMSpV加速器是SpMV加速器的补充，用于处理输入向量具有高稀疏性的情况。GraphLily进一步构建一个中间件来提供运行时支持。有了这个中间件，我们可以将现有的GraphBLAS程序移植到fpga上，只需对用于CPU/GPU执行的原始代码稍加修改。评估表明，与目前最先进的cpu和gpu图形处理框架相比，GraphLily实现了高达2.5倍和1.1倍的吞吐量，同时降低了8.1倍和2.4倍的能耗;与先前fpga上的单一用途图形加速器相比，GraphLily实现了1.2 x -1.9倍的高吞吐量。

{"title":"GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs","authors":"Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang","doi":"10.1109/ICCAD51958.2021.9643582","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643582","url":null,"abstract":"Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124845755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Traffic-Adaptive Power Reconfiguration for Energy-Efficient and Energy-Proportional Optical Interconnects 高能效和能量比例光互连的流量自适应功率重构

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643475

Yuyang Wang, K. Cheng

Silicon microring-based optical interconnects offer great potential for high-bandwidth data communication in future datacenters and high-performance computing systems. However, a lack of effective runtime power management strategies for optical links, especially during idle or low-utilization periods, is devastating to the energy efficiency and the energy proportionality of the network. In this study, we propose Polestar, i.e., POwer LEvel Scaling with Traffic-Adaptive Reconfiguration, for microring-based optical interconnects. Polestar offers a collection of runtime reconfiguration strategies that target the power states of the lasers and the microring tuning circuitry. The reconfiguration mechanism of the power states is traffic-adaptive for exploiting the trade-off between energy saving and application execution time. The evaluation of Polestar with production datacenter traces demonstrates up to 87 % reduction in pJ/b consumption and significant improvements in energy proportionality metrics, notably outperforming existing strategies.

基于硅微晶的光互连在未来的数据中心和高性能计算系统中为高带宽数据通信提供了巨大的潜力。然而，缺乏有效的光链路运行时功率管理策略，特别是在空闲或低利用率时期，对网络的能量效率和能量比例是毁灭性的。在这项研究中，我们提出了Polestar，即功率电平缩放与流量自适应重构，用于基于微微的光互连。Polestar提供了一系列针对激光器和微环调谐电路的功率状态的运行时重构策略。电源状态的重新配置机制是流量自适应的，可以在节能和应用程序执行时间之间进行权衡。对带有生产数据中心轨迹的Polestar的评估表明，pJ/b消耗降低了87%，能源比例指标显著改善，明显优于现有策略。

引用次数: 0

Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis 标准单元布局合成中晶体管的同步折叠和放置

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2021-11-01 DOI: 10.1109/ICCAD51958.2021.9643537

Kyeonghyeon Baek, Taewhan Kim

The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.

标准单元布局合成中的三个主要任务是晶体管折叠、晶体管放置和单元内布线，它们紧密相关，但通常一次执行一个，以降低极高的设计空间复杂性。在本文中，我们提出了一个集成的方法来解决晶体管折叠和放置这两个问题。提出了一种基于搜索树的设计空间探索全局最优算法，设计了一套有效的加速技术和基于动态规划的快速代价计算。此外，我们的算法结合了最小OD(氧化物扩散)慢跑约束，这密切依赖于晶体管的折叠和放置。据我们所知，这是第一个试图同时解决这两个问题的作品。通过对ASAP 7nm库中的晶体管网表和设计规则的实验表明，我们提出的方法能够在1秒内为每个网表合成最小尺寸的完全可路由的电池布局，优于ASAP 7nm库中的电池布局质量，否则可能需要数小时或数天的时间才能手动完成与我们的质量水平相当的布局。

{"title":"Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis","authors":"Kyeonghyeon Baek, Taewhan Kim","doi":"10.1109/ICCAD51958.2021.9643537","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643537","url":null,"abstract":"The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126996354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1