Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643478
Abdullah Ash-Saki, A. Suresh, R. Topaloglu, Swaroop Ghosh
An efficient quantum circuit (program) compiler aims to minimize the gate-count - through efficient instruction translation, routing, gate, and cancellation - to improve run-time and noise. Therefore, a high-efficiency compiler is paramount to enable the game-changing promises of quantum computers. To date, the quantum computing hardware providers are offering a software stack supporting their hardware. However, several third-party software toolchains, including compilers, are emerging. They support hardware from different vendors and potentially offer better efficiency. As the quantum computing ecosystem becomes more popular and practical, it is only prudent to assume that more companies will start offering software-as-a-service for quantum computers, including high-performance compilers. With the emergence of third-party compilers, the security and privacy issues of quantum intellectual properties (IPs) will follow. A quantum circuit can include sensitive information such as critical financial analysis and proprietary algorithms. Therefore, submitting quantum circuits to untrusted compilers creates opportunities for adversaries to steal IPs. In this paper, we present a split compilation methodology to secure IPs from untrusted compilers while taking advantage of their optimizations. In this methodology, a quantum circuit is split into multiple parts that are sent to a single compiler at different times or to multiple compilers. In this way, the adversary has access to partial information. With analysis of over 152 circuits on three IBM hardware architectures, we demonstrate the split compilation methodology can completely secure IPs (when multiple compilers are used) or can introduce factorial time reconstruction complexity while incurring a modest overhead (~ 3% to ~ 6% on average).
{"title":"Split Compilation for Security of Quantum Circuits","authors":"Abdullah Ash-Saki, A. Suresh, R. Topaloglu, Swaroop Ghosh","doi":"10.1109/ICCAD51958.2021.9643478","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643478","url":null,"abstract":"An efficient quantum circuit (program) compiler aims to minimize the gate-count - through efficient instruction translation, routing, gate, and cancellation - to improve run-time and noise. Therefore, a high-efficiency compiler is paramount to enable the game-changing promises of quantum computers. To date, the quantum computing hardware providers are offering a software stack supporting their hardware. However, several third-party software toolchains, including compilers, are emerging. They support hardware from different vendors and potentially offer better efficiency. As the quantum computing ecosystem becomes more popular and practical, it is only prudent to assume that more companies will start offering software-as-a-service for quantum computers, including high-performance compilers. With the emergence of third-party compilers, the security and privacy issues of quantum intellectual properties (IPs) will follow. A quantum circuit can include sensitive information such as critical financial analysis and proprietary algorithms. Therefore, submitting quantum circuits to untrusted compilers creates opportunities for adversaries to steal IPs. In this paper, we present a split compilation methodology to secure IPs from untrusted compilers while taking advantage of their optimizations. In this methodology, a quantum circuit is split into multiple parts that are sent to a single compiler at different times or to multiple compilers. In this way, the adversary has access to partial information. With analysis of over 152 circuits on three IBM hardware architectures, we demonstrate the split compilation methodology can completely secure IPs (when multiple compilers are used) or can introduce factorial time reconstruction complexity while incurring a modest overhead (~ 3% to ~ 6% on average).","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126820779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643463
M. Wei, Mikail Yayla, S. Ho, Jian-Jia Chen, Chia-Lin Yang, H. Amrouch
Spiking Neural Networks (SNNs) are considered the third generation of NNs and can reach similar accuracy as conventional deep NNs, but with a considerable improvement in efficiency. However, to achieve high accuracy, state-of-the-art SNNs employ stochastic spike coding of the inputs, requiring multiple cycles of computation. Because of this and due to the nature of analog computing, it is required to accumulate and hold the charges of multiple cycles, necessitating a large membrane capacitor. This results in high energy, long latency, and expensive area costs, constituting one of the major bottlenecks in analog SNN implementations. Membrane capacitor size determines the precision of the firing time. Hence reducing the capacitor size considerably degrades the inference accuracy. To alleviate this, we focus on bridging the gap between binarized NNs (BNNs) and SNNs. BNNs are rapidly emerging as an attractive alternative for NNs due to their high efficiency and error tolerance. In this work, we evaluate the impact of deploying error-resilient BNNs, i.e. BNNs that have been proactively trained in the presence of errors, on analog implementation of SNNs. We show that for BNNs, the capacitor size and latency can be reduced significantly compared to state-of-the-art SNNs, which employ multi-bit models. Our experiments demonstrate that when error-resilient BNNs are deployed on analog-based SNN accelerator, the size of the membrane capacitor is reduced by 50%, the inference latency is decreased by two orders of magnitude, and energy is reduced by 57% compared to the baseline 4-bit SNN implementation, under minimal accuracy cost.
{"title":"Binarized SNNs: Efficient and Error-Resilient Spiking Neural Networks through Binarization","authors":"M. Wei, Mikail Yayla, S. Ho, Jian-Jia Chen, Chia-Lin Yang, H. Amrouch","doi":"10.1109/ICCAD51958.2021.9643463","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643463","url":null,"abstract":"Spiking Neural Networks (SNNs) are considered the third generation of NNs and can reach similar accuracy as conventional deep NNs, but with a considerable improvement in efficiency. However, to achieve high accuracy, state-of-the-art SNNs employ stochastic spike coding of the inputs, requiring multiple cycles of computation. Because of this and due to the nature of analog computing, it is required to accumulate and hold the charges of multiple cycles, necessitating a large membrane capacitor. This results in high energy, long latency, and expensive area costs, constituting one of the major bottlenecks in analog SNN implementations. Membrane capacitor size determines the precision of the firing time. Hence reducing the capacitor size considerably degrades the inference accuracy. To alleviate this, we focus on bridging the gap between binarized NNs (BNNs) and SNNs. BNNs are rapidly emerging as an attractive alternative for NNs due to their high efficiency and error tolerance. In this work, we evaluate the impact of deploying error-resilient BNNs, i.e. BNNs that have been proactively trained in the presence of errors, on analog implementation of SNNs. We show that for BNNs, the capacitor size and latency can be reduced significantly compared to state-of-the-art SNNs, which employ multi-bit models. Our experiments demonstrate that when error-resilient BNNs are deployed on analog-based SNN accelerator, the size of the membrane capacitor is reduced by 50%, the inference latency is decreased by two orders of magnitude, and energy is reduced by 57% compared to the baseline 4-bit SNN implementation, under minimal accuracy cost.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128199812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643556
Mojan Javaheripi, F. Koushanfar
We propose Hashtag, the first framework that enables high-accuracy detection of fault-injection attacks on Deep Neural Networks (DNNs) with provable bounds on detection performance. Recent literature in fault-injection attacks shows the severe DNN accuracy degradation caused by bit flips. In this scenario, the attacker changes a few weight bits during DNN execution by tampering with the program's DRAM memory. To detect runtime bit flips, Hashtag extracts a unique signature from the benign DNN prior to deployment. The signature is later used to validate the integrity of the DNN and verify the inference output on the fly. We propose a novel sensitivity analysis scheme that accurately identifies the most vulnerable DNN layers to the fault-injection attack. The DNN signature is then constructed by encoding the underlying weights in the vulnerable layers using a low-collision hash function. When the DNN is deployed, new hashes are extracted from the target layers during inference and compared against the ground-truth signatures. Hashtag incorporates a lightweight methodology that ensures a low-overhead and real-time fault detection on embedded platforms. Extensive evaluations with the state-of-the-art bit-flip attack on various DNNs demonstrate the competitive advantage of Hashtag in terms of both attack detection and execution overhead.
{"title":"HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks on Deep Neural Networks","authors":"Mojan Javaheripi, F. Koushanfar","doi":"10.1109/ICCAD51958.2021.9643556","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643556","url":null,"abstract":"We propose Hashtag, the first framework that enables high-accuracy detection of fault-injection attacks on Deep Neural Networks (DNNs) with provable bounds on detection performance. Recent literature in fault-injection attacks shows the severe DNN accuracy degradation caused by bit flips. In this scenario, the attacker changes a few weight bits during DNN execution by tampering with the program's DRAM memory. To detect runtime bit flips, Hashtag extracts a unique signature from the benign DNN prior to deployment. The signature is later used to validate the integrity of the DNN and verify the inference output on the fly. We propose a novel sensitivity analysis scheme that accurately identifies the most vulnerable DNN layers to the fault-injection attack. The DNN signature is then constructed by encoding the underlying weights in the vulnerable layers using a low-collision hash function. When the DNN is deployed, new hashes are extracted from the target layers during inference and compared against the ground-truth signatures. Hashtag incorporates a lightweight methodology that ensures a low-overhead and real-time fault detection on embedded platforms. Extensive evaluations with the state-of-the-art bit-flip attack on various DNNs demonstrate the competitive advantage of Hashtag in terms of both attack detection and execution overhead.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123862305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643511
Aqeeb Iqbal Arka, Biresh Kumar Joardar, J. Doppa, P. Pande, K. Chakrabarty
Graph Neural Networks (GNNs) are a variant of Deep Neural Networks (DNNs) operating on graphs. GNNs have attributes of both DNNs and graph computation. However, training GNNs on manycore architectures is a challenging task because it involves heavy communication that bottlenecks performance. DropEdge and Dropout, which we collectively refer to as DropLayer, are regularization techniques that can improve the predictive accuracy of GNNs. Moreover, when implemented on a manycore architecture, DropEdge and Dropout are capable of reducing the on-chip traffic. In this paper, we present a ReRAM-based 3D manycore architecture called DARe, tailored for accelerating on-chip training of GNNs. The key component of the DARe architecture is a Network-on-Chip (NoC) that reduces the amount of communication using DropLayer. The reduced traffic prevents communication hotspots and leads to better performance. We demonstrate that DARe outperforms conventional GPUs by up to 6.7X (5.6X on average) in terms of execution time, while being up to 30X (23X on average) more energy efficient for GNN training.
{"title":"DARe: DropLayer-Aware Manycore ReRAM architecture for Training Graph Neural Networks","authors":"Aqeeb Iqbal Arka, Biresh Kumar Joardar, J. Doppa, P. Pande, K. Chakrabarty","doi":"10.1109/ICCAD51958.2021.9643511","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643511","url":null,"abstract":"Graph Neural Networks (GNNs) are a variant of Deep Neural Networks (DNNs) operating on graphs. GNNs have attributes of both DNNs and graph computation. However, training GNNs on manycore architectures is a challenging task because it involves heavy communication that bottlenecks performance. DropEdge and Dropout, which we collectively refer to as DropLayer, are regularization techniques that can improve the predictive accuracy of GNNs. Moreover, when implemented on a manycore architecture, DropEdge and Dropout are capable of reducing the on-chip traffic. In this paper, we present a ReRAM-based 3D manycore architecture called DARe, tailored for accelerating on-chip training of GNNs. The key component of the DARe architecture is a Network-on-Chip (NoC) that reduces the amount of communication using DropLayer. The reduced traffic prevents communication hotspots and leads to better performance. We demonstrate that DARe outperforms conventional GPUs by up to 6.7X (5.6X on average) in terms of execution time, while being up to 30X (23X on average) more energy efficient for GNN training.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116133461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643521
G. Pasandi, Sreedhar Pratty, David Brown, Yanqing Zhang, Haoxing Ren, Brucek Khailany
Logic rewriting is an important optimization function that can improve Quality of Results (QoR) in modern VLSI circuits. This optimization function usually has a greedy approach and involves steps such as graph traversal, cut computation and ranking, and functional matching. For logic rewriting to be effective in improving the QoR, there should be many local rewriting iterations which can be very slow for industrial level benchmark circuits. One effective solution to speed up the logic rewriting operation is to upload its time consuming steps to Graphics Processing Units (GPUs) to benefit from massively parallel computations that is available there. In this regard, the present contest problem studies the possibility of using GPUs in accelerating a classical logic rewriting function. State-of-the-art large-scale open-source benchmark circuits as well as industrial-level designs will be used to test the GPU accelerated logic rewriting function.
{"title":"2021 ICCAD CAD Contest Problem C: GPU Accelerated Logic Rewriting","authors":"G. Pasandi, Sreedhar Pratty, David Brown, Yanqing Zhang, Haoxing Ren, Brucek Khailany","doi":"10.1109/ICCAD51958.2021.9643521","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643521","url":null,"abstract":"Logic rewriting is an important optimization function that can improve Quality of Results (QoR) in modern VLSI circuits. This optimization function usually has a greedy approach and involves steps such as graph traversal, cut computation and ranking, and functional matching. For logic rewriting to be effective in improving the QoR, there should be many local rewriting iterations which can be very slow for industrial level benchmark circuits. One effective solution to speed up the logic rewriting operation is to upload its time consuming steps to Graphics Processing Units (GPUs) to benefit from massively parallel computations that is available there. In this regard, the present contest problem studies the possibility of using GPUs in accelerating a classical logic rewriting function. State-of-the-art large-scale open-source benchmark circuits as well as industrial-level designs will be used to test the GPU accelerated logic rewriting function.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121496279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643495
Kaveh Shamsi, Yier Jin
The problem of inferring the value of internal nets in a circuit from its power side-channels has been the topic of extensive research over the past two decades, with several frameworks developed mostly focusing on cryptographic hardware. In this paper, we focus on the problem of breaking logic locking, a technique in which an original circuit is made ambiguous by inserting unknown “key” bits into it, via power side-channels. We present a pair of attack algorithms we term PowerSAT attacks, which take in arbitrary keyed circuits and resolve key information by interacting adaptively with a side-channel “oracle”. They are based on the query-by-disagreement scheme used in functional SAT attacks against locking but utilize Psuedo-Boolean constraints to allow for reasoning about hamming-weight power models. We present a software implementation of the attacks along with techniques for speeding them up. We present simulation and FPGA-based experiments as well. Notably, we demonstrate the extraction of a 32-bit key from a comparator circuit with a $2^{31}$ functional query complexity, in $sim 64$ chosen power side-channel queries using the PowerSAT attack, where traditional CPA fails given 1000 random traces. We release a binary of our implementation along with the FPGA $+mathbf{scope} mathbf{HDL}/mathbf{setup}$ used for the experiments.
{"title":"Circuit Deobfuscation from Power Side-Channels using Pseudo-Boolean SAT","authors":"Kaveh Shamsi, Yier Jin","doi":"10.1109/ICCAD51958.2021.9643495","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643495","url":null,"abstract":"The problem of inferring the value of internal nets in a circuit from its power side-channels has been the topic of extensive research over the past two decades, with several frameworks developed mostly focusing on cryptographic hardware. In this paper, we focus on the problem of breaking logic locking, a technique in which an original circuit is made ambiguous by inserting unknown “key” bits into it, via power side-channels. We present a pair of attack algorithms we term PowerSAT attacks, which take in arbitrary keyed circuits and resolve key information by interacting adaptively with a side-channel “oracle”. They are based on the query-by-disagreement scheme used in functional SAT attacks against locking but utilize Psuedo-Boolean constraints to allow for reasoning about hamming-weight power models. We present a software implementation of the attacks along with techniques for speeding them up. We present simulation and FPGA-based experiments as well. Notably, we demonstrate the extraction of a 32-bit key from a comparator circuit with a $2^{31}$ functional query complexity, in $sim 64$ chosen power side-channel queries using the PowerSAT attack, where traditional CPA fails given 1000 random traces. We release a binary of our implementation along with the FPGA $+mathbf{scope} mathbf{HDL}/mathbf{setup}$ used for the experiments.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122688011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643472
W. Zhao, Qi Sun, Yang Bai, Wenbo Li, Haisheng Zheng, Bei Yu, Martin D. F. Wong
Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly and can achieve real-time performance.
{"title":"A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU","authors":"W. Zhao, Qi Sun, Yang Bai, Wenbo Li, Haisheng Zheng, Bei Yu, Martin D. F. Wong","doi":"10.1109/ICCAD51958.2021.9643472","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643472","url":null,"abstract":"Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly and can achieve real-time performance.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122722302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643582
Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang
Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.
由于计算对内存的访问比低和数据访问模式不规则,图形处理通常受内存限制。新兴的高带宽内存(HBM)通过提供可以并发处理内存请求的多个通道来提供卓越的带宽,从而有可能显著提高图形处理的性能。本文提出了一种图形线性代数叠加GraphLily来加速fpga上的图形处理。GraphLily通过采用GraphBLAS编程接口支持丰富的图算法集,该接口将图算法表述为稀疏线性代数运算。GraphLily为GraphBLAS中两种广泛使用的内核(即稀疏矩阵密集向量乘法(SpMV)和稀疏矩阵稀疏向量乘法(SpMSpV))提供了高效的内存优化加速器。SpMV加速器使用为HBM量身定制的稀疏矩阵存储格式,支持对每个通道的流式、矢量化访问和对多个通道的并发访问。此外,SpMV加速器通过引入可扩展的片上缓冲器设计,利用密集向量访问中的数据重用。SpMSpV加速器是SpMV加速器的补充,用于处理输入向量具有高稀疏性的情况。GraphLily进一步构建一个中间件来提供运行时支持。有了这个中间件,我们可以将现有的GraphBLAS程序移植到fpga上,只需对用于CPU/GPU执行的原始代码稍加修改。评估表明,与目前最先进的cpu和gpu图形处理框架相比,GraphLily实现了高达2.5倍和1.1倍的吞吐量,同时降低了8.1倍和2.4倍的能耗;与先前fpga上的单一用途图形加速器相比,GraphLily实现了1.2 x -1.9倍的高吞吐量。
{"title":"GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs","authors":"Yuwei Hu, Yixiao Du, Ecenur Ustun, Zhiru Zhang","doi":"10.1109/ICCAD51958.2021.9643582","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643582","url":null,"abstract":"Graph processing is typically memory bound due to low compute to memory access ratio and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth by providing multiple channels that can service memory requests concurrently, thus bringing the potential to significantly boost the performance of graph processing. This paper proposes GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in GraphBLAS, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). The SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. Evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5 x and 1.1 x higher throughput, while reducing the energy consumption by 8.1 x and 2.4 x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2 x -1.9 x higher throughput.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124845755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643475
Yuyang Wang, K. Cheng
Silicon microring-based optical interconnects offer great potential for high-bandwidth data communication in future datacenters and high-performance computing systems. However, a lack of effective runtime power management strategies for optical links, especially during idle or low-utilization periods, is devastating to the energy efficiency and the energy proportionality of the network. In this study, we propose Polestar, i.e., POwer LEvel Scaling with Traffic-Adaptive Reconfiguration, for microring-based optical interconnects. Polestar offers a collection of runtime reconfiguration strategies that target the power states of the lasers and the microring tuning circuitry. The reconfiguration mechanism of the power states is traffic-adaptive for exploiting the trade-off between energy saving and application execution time. The evaluation of Polestar with production datacenter traces demonstrates up to 87 % reduction in pJ/b consumption and significant improvements in energy proportionality metrics, notably outperforming existing strategies.
{"title":"Traffic-Adaptive Power Reconfiguration for Energy-Efficient and Energy-Proportional Optical Interconnects","authors":"Yuyang Wang, K. Cheng","doi":"10.1109/ICCAD51958.2021.9643475","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643475","url":null,"abstract":"Silicon microring-based optical interconnects offer great potential for high-bandwidth data communication in future datacenters and high-performance computing systems. However, a lack of effective runtime power management strategies for optical links, especially during idle or low-utilization periods, is devastating to the energy efficiency and the energy proportionality of the network. In this study, we propose Polestar, i.e., POwer LEvel Scaling with Traffic-Adaptive Reconfiguration, for microring-based optical interconnects. Polestar offers a collection of runtime reconfiguration strategies that target the power states of the lasers and the microring tuning circuitry. The reconfiguration mechanism of the power states is traffic-adaptive for exploiting the trade-off between energy saving and application execution time. The evaluation of Polestar with production datacenter traces demonstrates up to 87 % reduction in pJ/b consumption and significant improvements in energy proportionality metrics, notably outperforming existing strategies.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122095257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643537
Kyeonghyeon Baek, Taewhan Kim
The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.
{"title":"Simultaneous Transistor Folding and Placement in Standard Cell Layout Synthesis","authors":"Kyeonghyeon Baek, Taewhan Kim","doi":"10.1109/ICCAD51958.2021.9643537","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643537","url":null,"abstract":"The three major tasks in standard cell layout synthesis are transistor folding, transistor placement, and in-cell routing, which are tightly inter-related, but generally performed one at a time to reduce the extremely high complexity of design space. In this paper, we propose an integrated approach to the two problems of transistor folding and placement. Precisely, we propose a globally optimal algorithm of search tree based design space exploration, devising a set of effective speeding up techniques as well as dynamic programming based fast cost computation. In addition, our algorithm incorporates the minimum OD (oxide diffusion) jog constraint, which closely relies on both of transistor folding and placement. To our knowledge, this is the first work that tries to simultaneously solve the two problems. Through experiments with the transistor netlists and design rules in the ASAP 7nm library, it is shown that our proposed method is able to synthesize fully routable cell layouts of minimal size within 1 second for each netlist, outperforming the cell layout quality in the ASAP 7nm library, which otherwise, may take several hours or days to manually complete layouts of the quality level comparable to ours.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126996354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}