Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218522
Feng Xiong, Fengbin Tu, Man Shi, Yang Wang, Leibo Liu, Shaojun Wei, S. Yin
Deep convolutional neural networks (DCNNs), with extensive computation, require considerable external memory bandwidth and storage for intermediate feature maps. External memory accesses for feature maps become a significant energy bottleneck for DCNN accelerators. Many works have been done on quantizing feature maps into low precision to decrease the costs for computation and storage. There is an opportunity that the large amount of correlation among channels in feature maps can be exploited to further reduce external memory access. Towards this end, we propose a novel compression framework called Significance-aware Transform-based Codec (STC). In its compression process, significance-aware transform is introduced to obtain low-correlated feature maps in an orthogonal space, as the intrinsic representations of original feature maps. The transformed feature maps are quantized and encoded to compress external data transmission. For the next layer computation, the data will be reloaded with STC’s reconstruction process. The STC framework can be supported with a small set of extensions to current DCNN accelerators. We implement STC extensions to the baseline TPU architecture for hardware evaluation. The strengthened TPU achieves average reduction of 2.57x in external memory access, 1.95x~2.78x improvement of system-level energy efficiency, with a negligible accuracy loss of only 0.5%.
{"title":"STC: Significance-aware Transform-based Codec Framework for External Memory Access Reduction","authors":"Feng Xiong, Fengbin Tu, Man Shi, Yang Wang, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1109/DAC18072.2020.9218522","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218522","url":null,"abstract":"Deep convolutional neural networks (DCNNs), with extensive computation, require considerable external memory bandwidth and storage for intermediate feature maps. External memory accesses for feature maps become a significant energy bottleneck for DCNN accelerators. Many works have been done on quantizing feature maps into low precision to decrease the costs for computation and storage. There is an opportunity that the large amount of correlation among channels in feature maps can be exploited to further reduce external memory access. Towards this end, we propose a novel compression framework called Significance-aware Transform-based Codec (STC). In its compression process, significance-aware transform is introduced to obtain low-correlated feature maps in an orthogonal space, as the intrinsic representations of original feature maps. The transformed feature maps are quantized and encoded to compress external data transmission. For the next layer computation, the data will be reloaded with STC’s reconstruction process. The STC framework can be supported with a small set of extensions to current DCNN accelerators. We implement STC extensions to the baseline TPU architecture for hardware evaluation. The strengthened TPU achieves average reduction of 2.57x in external memory access, 1.95x~2.78x improvement of system-level energy efficiency, with a negligible accuracy loss of only 0.5%.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115268633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218591
I. Markov, Aneeqa Fatima, S. Isakov, S. Boixo
As quantum computers grow more capable, simulating them on conventional hardware becomes more challenging yet more attractive since this helps in design and verification. Some quantum algorithms and circuits are amenable to surprisingly efficient simulation, and this makes hard-to-simulate computations particularly valuable. For such circuits, we develop accurate massively-parallel simulation with dramatic speedups over earlier methods on 42- and 45-qubit circuits. We propose two ways to trade circuit fidelity for computational speedups, so as to match the error rate of any quantum computer. Using Google Cloud, we simulate approximate sampling from the output of a circuit with 7 × 8 qubits and depth 42 with fidelity 0.5% at an estimated cost of $35K.
{"title":"Massively Parallel Approximate Simulation of Hard Quantum Circuits","authors":"I. Markov, Aneeqa Fatima, S. Isakov, S. Boixo","doi":"10.1109/DAC18072.2020.9218591","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218591","url":null,"abstract":"As quantum computers grow more capable, simulating them on conventional hardware becomes more challenging yet more attractive since this helps in design and verification. Some quantum algorithms and circuits are amenable to surprisingly efficient simulation, and this makes hard-to-simulate computations particularly valuable. For such circuits, we develop accurate massively-parallel simulation with dramatic speedups over earlier methods on 42- and 45-qubit circuits. We propose two ways to trade circuit fidelity for computational speedups, so as to match the error rate of any quantum computer. Using Google Cloud, we simulate approximate sampling from the output of a circuit with 7 × 8 qubits and depth 42 with fidelity 0.5% at an estimated cost of $35K.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123055392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218694
Zhenkun Yang, Yuriy Viktorov, Jin Yang, Jiewen Yao, Vincent Zimmer
This paper presents a fuzzing framework for Unified Extensible Firmware Interface (UEFI) BIOS with the Simics virtual platform. Firmware has increasingly become an attack target as operating systems are getting more and more secure. Due to its special execution environment and the extensive interaction with hardware, UEFI firmware is difficult to test compared to user-level applications running on operating systems. Fortunately, virtual platforms are widely used to enable early software and firmware development by modeling the target hardware platform in its virtual environment before silicon arrives. Virtual platforms play a critical role in left shifting UEFI firmware validation to pre-silicon phase. We integrated the fuzzing capability into Simics virtual platform to allow users to fuzz UEFI firmware code with high-fidelity hardware models provided by Simics. We demonstrated the ability to automatically detect previously unknown bugs, and issues found only by human experts.
{"title":"UEFI Firmware Fuzzing with Simics Virtual Platform","authors":"Zhenkun Yang, Yuriy Viktorov, Jin Yang, Jiewen Yao, Vincent Zimmer","doi":"10.1109/DAC18072.2020.9218694","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218694","url":null,"abstract":"This paper presents a fuzzing framework for Unified Extensible Firmware Interface (UEFI) BIOS with the Simics virtual platform. Firmware has increasingly become an attack target as operating systems are getting more and more secure. Due to its special execution environment and the extensive interaction with hardware, UEFI firmware is difficult to test compared to user-level applications running on operating systems. Fortunately, virtual platforms are widely used to enable early software and firmware development by modeling the target hardware platform in its virtual environment before silicon arrives. Virtual platforms play a critical role in left shifting UEFI firmware validation to pre-silicon phase. We integrated the fuzzing capability into Simics virtual platform to allow users to fuzz UEFI firmware code with high-fidelity hardware models provided by Simics. We demonstrated the ability to automatically detect previously unknown bugs, and issues found only by human experts.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117147579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Convolutional Neural Network (CNN) has been widely used in robotics, which has dramatically improved the perception and decision-making ability of robots. A series of CNN accelerators have been designed to implement energy-efficient CNN on embedded systems. However, despite the high energy efficiency on CNN accelerators, it is difficult for robotics developers to use it. Since the various functions on the robot are usually implemented independently by different developers, simultaneous access to the CNN accelerator by these multiple independent processes will result in hardware resources conflicts.To handle the above problem, we propose an INterruptible CNN Accelerator (INCA) to enable multi-tasking on CNN accelerators. In INCA, we propose a Virtual-Instruction-based interrupt method (VI method) to support multi-task on CNN accelerators. Based on INCA, we deploy the Distributed Simultaneously Localization and Mapping (DSLAM) on an embedded FPGA platform. We use CNN to implement two key components in DSLAM, Feature-point Extraction (FE) and Place Recognition (PR), so that they can both be accelerated on the same CNN accelerator. Experimental results show that, compared to the layer-by-layer interrupt method, our VI method reduces the interrupt respond latency to 1%.
{"title":"INCA: INterruptible CNN Accelerator for Multi-tasking in Embedded Robots","authors":"Jincheng Yu, Zhilin Xu, Shulin Zeng, Chao Yu, Jiantao Qiu, Chaoyang Shen, Yuanfan Xu, Guohao Dai, Yu Wang, Huazhong Yang","doi":"10.1109/DAC18072.2020.9218717","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218717","url":null,"abstract":"In recent years, Convolutional Neural Network (CNN) has been widely used in robotics, which has dramatically improved the perception and decision-making ability of robots. A series of CNN accelerators have been designed to implement energy-efficient CNN on embedded systems. However, despite the high energy efficiency on CNN accelerators, it is difficult for robotics developers to use it. Since the various functions on the robot are usually implemented independently by different developers, simultaneous access to the CNN accelerator by these multiple independent processes will result in hardware resources conflicts.To handle the above problem, we propose an INterruptible CNN Accelerator (INCA) to enable multi-tasking on CNN accelerators. In INCA, we propose a Virtual-Instruction-based interrupt method (VI method) to support multi-task on CNN accelerators. Based on INCA, we deploy the Distributed Simultaneously Localization and Mapping (DSLAM) on an embedded FPGA platform. We use CNN to implement two key components in DSLAM, Feature-point Extraction (FE) and Place Recognition (PR), so that they can both be accelerated on the same CNN accelerator. Experimental results show that, compared to the layer-by-layer interrupt method, our VI method reduces the interrupt respond latency to 1%.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117217500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218592
Shuhan Zhang, Fan Yang, Dian Zhou, Xuan Zeng
In this paper, we propose EasyBO, an Efficient ASYn-chronous Batch Bayesian Optimization approach for analog circuit synthesis. In this proposed approach, instead of waiting for the slowest simulations in the batch to finish, we accelerate the optimization procedure by asynchronously issuing the next query points whenever there is an idle worker. We introduce a new acquisition function which can better explore the design space for asynchronous batch Bayesian optimization. A new strategy is proposed to better balance the exploration and exploitation and guarantee the diversity of the query points. And a penalization scheme is proposed to further avoid redundant queries during the asynchronous batch optimization. The efficiency of optimization can thus be further improved. Compared with the state-of-the-art batch Bayesian optimization algorithm, EasyBO achieves up to 7.35× speed-up without sacrificing the optimization results.
{"title":"An Efficient Asynchronous Batch Bayesian Optimization Approach for Analog Circuit Synthesis","authors":"Shuhan Zhang, Fan Yang, Dian Zhou, Xuan Zeng","doi":"10.1109/DAC18072.2020.9218592","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218592","url":null,"abstract":"In this paper, we propose EasyBO, an Efficient ASYn-chronous Batch Bayesian Optimization approach for analog circuit synthesis. In this proposed approach, instead of waiting for the slowest simulations in the batch to finish, we accelerate the optimization procedure by asynchronously issuing the next query points whenever there is an idle worker. We introduce a new acquisition function which can better explore the design space for asynchronous batch Bayesian optimization. A new strategy is proposed to better balance the exploration and exploitation and guarantee the diversity of the query points. And a penalization scheme is proposed to further avoid redundant queries during the asynchronous batch optimization. The efficiency of optimization can thus be further improved. Compared with the state-of-the-art batch Bayesian optimization algorithm, EasyBO achieves up to 7.35× speed-up without sacrificing the optimization results.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"317 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121111469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218712
H. Cheng, I. Jiang, Oscar Ou
Timing optimization is repeatedly performed throughout the entire design flow. The long turn-around time of querying a sign-off timer has become a bottleneck. To break through the bottleneck, a fast and accurate timing estimator is desirable to expedite the pace of timing closure. Unlike gate timing, which is calculated by interpolating lookup tables in cell libraries, wire timing calculation has remained a mystery in timing analysis. The mysterious formula and complex net structures increase the difficulty to correlate with the results generated by a sign-off timer, thus further preventing incremental timing optimization engines from accurate timing estimation without querying a sign-off timer. We attempt to solve the mystery by a novel machine-learning-based wire timing model. Different from prior machine learning models, we first extract topological features to capture the characteristics of RC networks. Then, we propose a loop breaking algorithm to transform non-tree nets into tree structures, and thus non-tree nets can be handled in the same way as tree-structured nets. Experiments are conducted on four industrial designs with tree-like nets (28nm) and two industrial designs with non-tree nets (16nm). Our results show that the prediction model trained by XGBoost is highly accurate: For both tree-like and non-tree nets, the mean error of wire delay is lower than 2 ps. The predicted path arrival times have less than 1% mean error. Experimental results also demonstrate that our model can be trained only once and applied to different designs using the same manufacturing process. Our fast and accurate wire timing prediction can easily be integrated into incremental timing optimization and expedites timing closure.
{"title":"Fast and Accurate Wire Timing Estimation on Tree and Non-Tree Net Structures","authors":"H. Cheng, I. Jiang, Oscar Ou","doi":"10.1109/DAC18072.2020.9218712","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218712","url":null,"abstract":"Timing optimization is repeatedly performed throughout the entire design flow. The long turn-around time of querying a sign-off timer has become a bottleneck. To break through the bottleneck, a fast and accurate timing estimator is desirable to expedite the pace of timing closure. Unlike gate timing, which is calculated by interpolating lookup tables in cell libraries, wire timing calculation has remained a mystery in timing analysis. The mysterious formula and complex net structures increase the difficulty to correlate with the results generated by a sign-off timer, thus further preventing incremental timing optimization engines from accurate timing estimation without querying a sign-off timer. We attempt to solve the mystery by a novel machine-learning-based wire timing model. Different from prior machine learning models, we first extract topological features to capture the characteristics of RC networks. Then, we propose a loop breaking algorithm to transform non-tree nets into tree structures, and thus non-tree nets can be handled in the same way as tree-structured nets. Experiments are conducted on four industrial designs with tree-like nets (28nm) and two industrial designs with non-tree nets (16nm). Our results show that the prediction model trained by XGBoost is highly accurate: For both tree-like and non-tree nets, the mean error of wire delay is lower than 2 ps. The predicted path arrival times have less than 1% mean error. Experimental results also demonstrate that our model can be trained only once and applied to different designs using the same manufacturing process. Our fast and accurate wire timing prediction can easily be integrated into incremental timing optimization and expedites timing closure.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124868840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218701
Xizi Chen, Jingyang Zhu, Jingbo Jiang, C. Tsui
The unstructured sparsity after pruning poses a challenge to the efficient implementation of deep learning models in existing regular architectures like systolic arrays. The coarse-grained structured pruning, on the other hand, tends to have higher accuracy loss than unstructured pruning when the pruned models are of the same size. In this work, we propose a compression method based on the unstructured pruning and a novel weight permutation scheme. Through permutation, the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Compared to the state-of-the-art works, the matrix compression rate is effectively improved from 5.88x to 10.28x. As a result, the throughput and energy efficiency are improved by 2.12 and 1.57 times, respectively.
{"title":"Tight Compression: Compressing CNN Model Tightly Through Unstructured Pruning and Simulated Annealing Based Permutation","authors":"Xizi Chen, Jingyang Zhu, Jingbo Jiang, C. Tsui","doi":"10.1109/DAC18072.2020.9218701","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218701","url":null,"abstract":"The unstructured sparsity after pruning poses a challenge to the efficient implementation of deep learning models in existing regular architectures like systolic arrays. The coarse-grained structured pruning, on the other hand, tends to have higher accuracy loss than unstructured pruning when the pruned models are of the same size. In this work, we propose a compression method based on the unstructured pruning and a novel weight permutation scheme. Through permutation, the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Compared to the state-of-the-art works, the matrix compression rate is effectively improved from 5.88x to 10.28x. As a result, the throughput and energy efficiency are improved by 2.12 and 1.57 times, respectively.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125927851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218503
Minjin Tang, M. Wen, Junzhong Shen, Xiaolei Zhao, Chunyuan Zhang
Obtaining item frequencies in data streams with limited space is a well-recognized and challenging problem in a wide range of applications. Sketch-based solutions have been widely used to address this challenge due to their ability to accurately record the data streams at a low memory cost. However, most sketches suffer from low memory utilization due to the adoption of a fixed counter size. Accordingly, in this work, we propose a counter-cascading scheduling algorithm to maximize the memory utilization of sketches without incurring any accuracy loss. In addition, we propose an FPGA-based system design that supports sketch parameter learning, counter-cascading record and online query. We implement our designs on Xilinx VCU118, and conduct evaluations on real-world traces, thereby demonstrating that our design can achieve higher accuracy with lower storage; the performance achieved is 10× ∼ 20× better than that of state-of-the-art sketches.
{"title":"Towards Memory-Efficient Streaming Processing with Counter-Cascading Sketching on FPGA","authors":"Minjin Tang, M. Wen, Junzhong Shen, Xiaolei Zhao, Chunyuan Zhang","doi":"10.1109/DAC18072.2020.9218503","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218503","url":null,"abstract":"Obtaining item frequencies in data streams with limited space is a well-recognized and challenging problem in a wide range of applications. Sketch-based solutions have been widely used to address this challenge due to their ability to accurately record the data streams at a low memory cost. However, most sketches suffer from low memory utilization due to the adoption of a fixed counter size. Accordingly, in this work, we propose a counter-cascading scheduling algorithm to maximize the memory utilization of sketches without incurring any accuracy loss. In addition, we propose an FPGA-based system design that supports sketch parameter learning, counter-cascading record and online query. We implement our designs on Xilinx VCU118, and conduct evaluations on real-world traces, thereby demonstrating that our design can achieve higher accuracy with lower storage; the performance achieved is 10× ∼ 20× better than that of state-of-the-art sketches.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126647529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218726
Donghyun Kang, S. Ha
There is a growing need for data reorganization in recent neural networks for various applications such as Generative Adversarial Networks(GANs) that use transposed convolution and U-Net that requires upsampling. We propose a novel technique, called tensor virtualization technique, to perform data reorganization efficiently with a minimal hardware addition for adder-tree based CNN accelerators. In the proposed technique, a data reorganization request is specified with a few parameters and data reorganization is performed in the virtual space without overhead in the physical memory. It allows existing adder-tree-based CNN accelerators to accelerate a wide range of neural networks that require data reorganization, including U-Net, DCGAN, and SRGAN.
{"title":"Tensor Virtualization Technique to Support Efficient Data Reorganization for CNN Accelerators","authors":"Donghyun Kang, S. Ha","doi":"10.1109/DAC18072.2020.9218726","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218726","url":null,"abstract":"There is a growing need for data reorganization in recent neural networks for various applications such as Generative Adversarial Networks(GANs) that use transposed convolution and U-Net that requires upsampling. We propose a novel technique, called tensor virtualization technique, to perform data reorganization efficiently with a minimal hardware addition for adder-tree based CNN accelerators. In the proposed technique, a data reorganization request is specified with a few parameters and data reorganization is performed in the virtual space without overhead in the physical memory. It allows existing adder-tree-based CNN accelerators to accelerate a wide range of neural networks that require data reorganization, including U-Net, DCGAN, and SRGAN.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115325740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218511
D. Serpanos, Shengqi Yang, M. Wolf
This paper surveys results in the use of neural networks and deep learning in two areas of hardware security: power attacks and physically-unclonable functions (PUFs).
本文调查了神经网络和深度学习在两个硬件安全领域的应用结果:电源攻击和物理不可克隆功能(puf)。
{"title":"Neural Network-Based Side Channel Attacks and Countermeasures","authors":"D. Serpanos, Shengqi Yang, M. Wolf","doi":"10.1109/DAC18072.2020.9218511","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218511","url":null,"abstract":"This paper surveys results in the use of neural networks and deep learning in two areas of hardware security: power attacks and physically-unclonable functions (PUFs).","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122424249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}