首页 > 最新文献

2020 57th ACM/IEEE Design Automation Conference (DAC)最新文献

英文 中文
STC: Significance-aware Transform-based Codec Framework for External Memory Access Reduction 减少外部存储器访问的基于意义感知转换的编解码器框架
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218522
Feng Xiong, Fengbin Tu, Man Shi, Yang Wang, Leibo Liu, Shaojun Wei, S. Yin
Deep convolutional neural networks (DCNNs), with extensive computation, require considerable external memory bandwidth and storage for intermediate feature maps. External memory accesses for feature maps become a significant energy bottleneck for DCNN accelerators. Many works have been done on quantizing feature maps into low precision to decrease the costs for computation and storage. There is an opportunity that the large amount of correlation among channels in feature maps can be exploited to further reduce external memory access. Towards this end, we propose a novel compression framework called Significance-aware Transform-based Codec (STC). In its compression process, significance-aware transform is introduced to obtain low-correlated feature maps in an orthogonal space, as the intrinsic representations of original feature maps. The transformed feature maps are quantized and encoded to compress external data transmission. For the next layer computation, the data will be reloaded with STC’s reconstruction process. The STC framework can be supported with a small set of extensions to current DCNN accelerators. We implement STC extensions to the baseline TPU architecture for hardware evaluation. The strengthened TPU achieves average reduction of 2.57x in external memory access, 1.95x~2.78x improvement of system-level energy efficiency, with a negligible accuracy loss of only 0.5%.
深度卷积神经网络(Deep convolutional neural networks, DCNNs)计算量大,需要相当大的外部存储带宽和中间特征映射的存储空间。特征映射的外部存储器访问成为DCNN加速器的一个重要的能量瓶颈。为了降低计算和存储成本,对特征映射进行了低精度量化。有机会利用特征映射中通道之间的大量相关性来进一步减少外部存储器访问。为此,我们提出了一种新的压缩框架,称为意义感知转换编解码器(STC)。在压缩过程中,引入意义感知变换,在正交空间中获得低相关特征映射,作为原始特征映射的内在表示。对变换后的特征映射进行量化和编码,以压缩外部数据传输。对于下一层计算,将使用STC的重构过程重新加载数据。STC框架可以通过对当前DCNN加速器的一小部分扩展来支持。我们在基准TPU架构上实现STC扩展以进行硬件评估。增强后的TPU外部存储器访问平均减少2.57倍,系统级能效提高1.95 ~2.78倍,精度损失仅为0.5%,可以忽略不计。
{"title":"STC: Significance-aware Transform-based Codec Framework for External Memory Access Reduction","authors":"Feng Xiong, Fengbin Tu, Man Shi, Yang Wang, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1109/DAC18072.2020.9218522","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218522","url":null,"abstract":"Deep convolutional neural networks (DCNNs), with extensive computation, require considerable external memory bandwidth and storage for intermediate feature maps. External memory accesses for feature maps become a significant energy bottleneck for DCNN accelerators. Many works have been done on quantizing feature maps into low precision to decrease the costs for computation and storage. There is an opportunity that the large amount of correlation among channels in feature maps can be exploited to further reduce external memory access. Towards this end, we propose a novel compression framework called Significance-aware Transform-based Codec (STC). In its compression process, significance-aware transform is introduced to obtain low-correlated feature maps in an orthogonal space, as the intrinsic representations of original feature maps. The transformed feature maps are quantized and encoded to compress external data transmission. For the next layer computation, the data will be reloaded with STC’s reconstruction process. The STC framework can be supported with a small set of extensions to current DCNN accelerators. We implement STC extensions to the baseline TPU architecture for hardware evaluation. The strengthened TPU achieves average reduction of 2.57x in external memory access, 1.95x~2.78x improvement of system-level energy efficiency, with a negligible accuracy loss of only 0.5%.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115268633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Massively Parallel Approximate Simulation of Hard Quantum Circuits 硬量子电路的大规模并行近似模拟
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218591
I. Markov, Aneeqa Fatima, S. Isakov, S. Boixo
As quantum computers grow more capable, simulating them on conventional hardware becomes more challenging yet more attractive since this helps in design and verification. Some quantum algorithms and circuits are amenable to surprisingly efficient simulation, and this makes hard-to-simulate computations particularly valuable. For such circuits, we develop accurate massively-parallel simulation with dramatic speedups over earlier methods on 42- and 45-qubit circuits. We propose two ways to trade circuit fidelity for computational speedups, so as to match the error rate of any quantum computer. Using Google Cloud, we simulate approximate sampling from the output of a circuit with 7 × 8 qubits and depth 42 with fidelity 0.5% at an estimated cost of $35K.
随着量子计算机变得越来越强大,在传统硬件上模拟它们变得更具挑战性,但也更具吸引力,因为这有助于设计和验证。一些量子算法和电路适用于令人惊讶的高效模拟,这使得难以模拟的计算特别有价值。对于这种电路,我们在42和45量子位电路上开发了精确的大规模并行模拟,其速度比先前的方法要快得多。我们提出了两种以电路保真度换取计算速度的方法,以匹配任何量子计算机的错误率。使用Google Cloud,我们模拟了7 × 8量子位和深度42的电路输出的近似采样,保真度为0.5%,估计成本为3.5万美元。
{"title":"Massively Parallel Approximate Simulation of Hard Quantum Circuits","authors":"I. Markov, Aneeqa Fatima, S. Isakov, S. Boixo","doi":"10.1109/DAC18072.2020.9218591","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218591","url":null,"abstract":"As quantum computers grow more capable, simulating them on conventional hardware becomes more challenging yet more attractive since this helps in design and verification. Some quantum algorithms and circuits are amenable to surprisingly efficient simulation, and this makes hard-to-simulate computations particularly valuable. For such circuits, we develop accurate massively-parallel simulation with dramatic speedups over earlier methods on 42- and 45-qubit circuits. We propose two ways to trade circuit fidelity for computational speedups, so as to match the error rate of any quantum computer. Using Google Cloud, we simulate approximate sampling from the output of a circuit with 7 × 8 qubits and depth 42 with fidelity 0.5% at an estimated cost of $35K.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123055392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
UEFI Firmware Fuzzing with Simics Virtual Platform 基于Simics虚拟平台的UEFI固件模糊测试
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218694
Zhenkun Yang, Yuriy Viktorov, Jin Yang, Jiewen Yao, Vincent Zimmer
This paper presents a fuzzing framework for Unified Extensible Firmware Interface (UEFI) BIOS with the Simics virtual platform. Firmware has increasingly become an attack target as operating systems are getting more and more secure. Due to its special execution environment and the extensive interaction with hardware, UEFI firmware is difficult to test compared to user-level applications running on operating systems. Fortunately, virtual platforms are widely used to enable early software and firmware development by modeling the target hardware platform in its virtual environment before silicon arrives. Virtual platforms play a critical role in left shifting UEFI firmware validation to pre-silicon phase. We integrated the fuzzing capability into Simics virtual platform to allow users to fuzz UEFI firmware code with high-fidelity hardware models provided by Simics. We demonstrated the ability to automatically detect previously unknown bugs, and issues found only by human experts.
本文提出了一种基于Simics虚拟平台的UEFI BIOS模糊测试框架。随着操作系统的安全性越来越高,固件越来越成为攻击的目标。由于其特殊的执行环境和与硬件的广泛交互,与运行在操作系统上的用户级应用程序相比,UEFI固件难以测试。幸运的是,虚拟平台被广泛用于实现早期的软件和固件开发,方法是在硅到达之前在其虚拟环境中对目标硬件平台进行建模。虚拟平台在将UEFI固件验证左移到预硅阶段的过程中发挥着关键作用。我们将模糊功能集成到Simics虚拟平台中,允许用户使用Simics提供的高保真硬件模型模糊UEFI固件代码。我们演示了自动检测以前未知的错误和只有人类专家才能发现的问题的能力。
{"title":"UEFI Firmware Fuzzing with Simics Virtual Platform","authors":"Zhenkun Yang, Yuriy Viktorov, Jin Yang, Jiewen Yao, Vincent Zimmer","doi":"10.1109/DAC18072.2020.9218694","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218694","url":null,"abstract":"This paper presents a fuzzing framework for Unified Extensible Firmware Interface (UEFI) BIOS with the Simics virtual platform. Firmware has increasingly become an attack target as operating systems are getting more and more secure. Due to its special execution environment and the extensive interaction with hardware, UEFI firmware is difficult to test compared to user-level applications running on operating systems. Fortunately, virtual platforms are widely used to enable early software and firmware development by modeling the target hardware platform in its virtual environment before silicon arrives. Virtual platforms play a critical role in left shifting UEFI firmware validation to pre-silicon phase. We integrated the fuzzing capability into Simics virtual platform to allow users to fuzz UEFI firmware code with high-fidelity hardware models provided by Simics. We demonstrated the ability to automatically detect previously unknown bugs, and issues found only by human experts.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117147579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
INCA: INterruptible CNN Accelerator for Multi-tasking in Embedded Robots INCA:嵌入式机器人多任务的可中断CNN加速器
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218717
Jincheng Yu, Zhilin Xu, Shulin Zeng, Chao Yu, Jiantao Qiu, Chaoyang Shen, Yuanfan Xu, Guohao Dai, Yu Wang, Huazhong Yang
In recent years, Convolutional Neural Network (CNN) has been widely used in robotics, which has dramatically improved the perception and decision-making ability of robots. A series of CNN accelerators have been designed to implement energy-efficient CNN on embedded systems. However, despite the high energy efficiency on CNN accelerators, it is difficult for robotics developers to use it. Since the various functions on the robot are usually implemented independently by different developers, simultaneous access to the CNN accelerator by these multiple independent processes will result in hardware resources conflicts.To handle the above problem, we propose an INterruptible CNN Accelerator (INCA) to enable multi-tasking on CNN accelerators. In INCA, we propose a Virtual-Instruction-based interrupt method (VI method) to support multi-task on CNN accelerators. Based on INCA, we deploy the Distributed Simultaneously Localization and Mapping (DSLAM) on an embedded FPGA platform. We use CNN to implement two key components in DSLAM, Feature-point Extraction (FE) and Place Recognition (PR), so that they can both be accelerated on the same CNN accelerator. Experimental results show that, compared to the layer-by-layer interrupt method, our VI method reduces the interrupt respond latency to 1%.
近年来,卷积神经网络(CNN)在机器人领域得到了广泛的应用,极大地提高了机器人的感知和决策能力。为了在嵌入式系统上实现高能效的CNN,设计了一系列CNN加速器。然而,尽管CNN加速器的能量效率很高,但机器人开发人员很难使用它。由于机器人上的各种功能通常由不同的开发人员独立实现,这多个独立进程同时访问CNN加速器会导致硬件资源冲突。为了解决上述问题,我们提出了一种可中断CNN加速器(INCA)来实现CNN加速器上的多任务处理。在INCA中,我们提出了一种基于虚拟指令的中断方法(VI方法)来支持CNN加速器上的多任务。基于INCA,我们在嵌入式FPGA平台上部署了分布式同步定位与映射(DSLAM)。我们使用CNN实现了DSLAM中的两个关键组件,Feature-point Extraction (FE)和Place Recognition (PR),这样它们就可以在同一个CNN加速器上进行加速。实验结果表明,与逐层中断方法相比,我们的方法将中断响应延迟降低到1%。
{"title":"INCA: INterruptible CNN Accelerator for Multi-tasking in Embedded Robots","authors":"Jincheng Yu, Zhilin Xu, Shulin Zeng, Chao Yu, Jiantao Qiu, Chaoyang Shen, Yuanfan Xu, Guohao Dai, Yu Wang, Huazhong Yang","doi":"10.1109/DAC18072.2020.9218717","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218717","url":null,"abstract":"In recent years, Convolutional Neural Network (CNN) has been widely used in robotics, which has dramatically improved the perception and decision-making ability of robots. A series of CNN accelerators have been designed to implement energy-efficient CNN on embedded systems. However, despite the high energy efficiency on CNN accelerators, it is difficult for robotics developers to use it. Since the various functions on the robot are usually implemented independently by different developers, simultaneous access to the CNN accelerator by these multiple independent processes will result in hardware resources conflicts.To handle the above problem, we propose an INterruptible CNN Accelerator (INCA) to enable multi-tasking on CNN accelerators. In INCA, we propose a Virtual-Instruction-based interrupt method (VI method) to support multi-task on CNN accelerators. Based on INCA, we deploy the Distributed Simultaneously Localization and Mapping (DSLAM) on an embedded FPGA platform. We use CNN to implement two key components in DSLAM, Feature-point Extraction (FE) and Place Recognition (PR), so that they can both be accelerated on the same CNN accelerator. Experimental results show that, compared to the layer-by-layer interrupt method, our VI method reduces the interrupt respond latency to 1%.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117217500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An Efficient Asynchronous Batch Bayesian Optimization Approach for Analog Circuit Synthesis 模拟电路合成中一种高效的异步批处理贝叶斯优化方法
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218592
Shuhan Zhang, Fan Yang, Dian Zhou, Xuan Zeng
In this paper, we propose EasyBO, an Efficient ASYn-chronous Batch Bayesian Optimization approach for analog circuit synthesis. In this proposed approach, instead of waiting for the slowest simulations in the batch to finish, we accelerate the optimization procedure by asynchronously issuing the next query points whenever there is an idle worker. We introduce a new acquisition function which can better explore the design space for asynchronous batch Bayesian optimization. A new strategy is proposed to better balance the exploration and exploitation and guarantee the diversity of the query points. And a penalization scheme is proposed to further avoid redundant queries during the asynchronous batch optimization. The efficiency of optimization can thus be further improved. Compared with the state-of-the-art batch Bayesian optimization algorithm, EasyBO achieves up to 7.35× speed-up without sacrificing the optimization results.
在本文中,我们提出了一种用于模拟电路合成的高效异步批处理贝叶斯优化方法EasyBO。在这种提出的方法中,我们不是等待批处理中最慢的模拟完成,而是通过在有空闲工作者时异步发出下一个查询点来加速优化过程。我们引入了一个新的获取函数,可以更好地探索异步批处理贝叶斯优化的设计空间。提出了一种新的策略,以更好地平衡查询点的探索和开发,保证查询点的多样性。为了进一步避免异步批处理优化过程中的冗余查询,提出了一种惩罚方案。从而进一步提高优化效率。与目前最先进的批处理贝叶斯优化算法相比,EasyBO在不牺牲优化结果的情况下实现了高达7.35倍的加速。
{"title":"An Efficient Asynchronous Batch Bayesian Optimization Approach for Analog Circuit Synthesis","authors":"Shuhan Zhang, Fan Yang, Dian Zhou, Xuan Zeng","doi":"10.1109/DAC18072.2020.9218592","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218592","url":null,"abstract":"In this paper, we propose EasyBO, an Efficient ASYn-chronous Batch Bayesian Optimization approach for analog circuit synthesis. In this proposed approach, instead of waiting for the slowest simulations in the batch to finish, we accelerate the optimization procedure by asynchronously issuing the next query points whenever there is an idle worker. We introduce a new acquisition function which can better explore the design space for asynchronous batch Bayesian optimization. A new strategy is proposed to better balance the exploration and exploitation and guarantee the diversity of the query points. And a penalization scheme is proposed to further avoid redundant queries during the asynchronous batch optimization. The efficiency of optimization can thus be further improved. Compared with the state-of-the-art batch Bayesian optimization algorithm, EasyBO achieves up to 7.35× speed-up without sacrificing the optimization results.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"317 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121111469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Fast and Accurate Wire Timing Estimation on Tree and Non-Tree Net Structures 树形和非树形网络结构快速准确的布线时间估计
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218712
H. Cheng, I. Jiang, Oscar Ou
Timing optimization is repeatedly performed throughout the entire design flow. The long turn-around time of querying a sign-off timer has become a bottleneck. To break through the bottleneck, a fast and accurate timing estimator is desirable to expedite the pace of timing closure. Unlike gate timing, which is calculated by interpolating lookup tables in cell libraries, wire timing calculation has remained a mystery in timing analysis. The mysterious formula and complex net structures increase the difficulty to correlate with the results generated by a sign-off timer, thus further preventing incremental timing optimization engines from accurate timing estimation without querying a sign-off timer. We attempt to solve the mystery by a novel machine-learning-based wire timing model. Different from prior machine learning models, we first extract topological features to capture the characteristics of RC networks. Then, we propose a loop breaking algorithm to transform non-tree nets into tree structures, and thus non-tree nets can be handled in the same way as tree-structured nets. Experiments are conducted on four industrial designs with tree-like nets (28nm) and two industrial designs with non-tree nets (16nm). Our results show that the prediction model trained by XGBoost is highly accurate: For both tree-like and non-tree nets, the mean error of wire delay is lower than 2 ps. The predicted path arrival times have less than 1% mean error. Experimental results also demonstrate that our model can be trained only once and applied to different designs using the same manufacturing process. Our fast and accurate wire timing prediction can easily be integrated into incremental timing optimization and expedites timing closure.
在整个设计流程中反复执行时序优化。查询签收计时器的长周期已成为一个瓶颈。为了突破这一瓶颈,需要一个快速准确的定时估计器来加快定时关闭的速度。与通过在单元库中插入查找表来计算门时序不同,线时序计算在时序分析中仍然是一个谜。神秘的公式和复杂的网络结构增加了与签到计时器生成的结果相关联的难度,从而进一步阻止了增量计时优化引擎在不查询签到计时器的情况下进行准确的计时估计。我们试图通过一种新的基于机器学习的电线定时模型来解决这个谜题。与之前的机器学习模型不同,我们首先提取拓扑特征来捕捉RC网络的特征。然后,我们提出了一种循环破环算法,将非树形网络转换为树形结构,从而可以像树形结构网络一样处理非树形网络。实验采用4种树形网(28nm)工业设计和2种非树形网(16nm)工业设计。我们的结果表明,XGBoost训练的预测模型具有很高的准确性:对于树状和非树状网络,线延迟的平均误差小于2 ps,预测路径到达时间的平均误差小于1%。实验结果还表明,该模型可以只训练一次,并适用于不同的设计,使用相同的制造工艺。我们快速准确的电线定时预测可以很容易地集成到增量定时优化和加速定时关闭。
{"title":"Fast and Accurate Wire Timing Estimation on Tree and Non-Tree Net Structures","authors":"H. Cheng, I. Jiang, Oscar Ou","doi":"10.1109/DAC18072.2020.9218712","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218712","url":null,"abstract":"Timing optimization is repeatedly performed throughout the entire design flow. The long turn-around time of querying a sign-off timer has become a bottleneck. To break through the bottleneck, a fast and accurate timing estimator is desirable to expedite the pace of timing closure. Unlike gate timing, which is calculated by interpolating lookup tables in cell libraries, wire timing calculation has remained a mystery in timing analysis. The mysterious formula and complex net structures increase the difficulty to correlate with the results generated by a sign-off timer, thus further preventing incremental timing optimization engines from accurate timing estimation without querying a sign-off timer. We attempt to solve the mystery by a novel machine-learning-based wire timing model. Different from prior machine learning models, we first extract topological features to capture the characteristics of RC networks. Then, we propose a loop breaking algorithm to transform non-tree nets into tree structures, and thus non-tree nets can be handled in the same way as tree-structured nets. Experiments are conducted on four industrial designs with tree-like nets (28nm) and two industrial designs with non-tree nets (16nm). Our results show that the prediction model trained by XGBoost is highly accurate: For both tree-like and non-tree nets, the mean error of wire delay is lower than 2 ps. The predicted path arrival times have less than 1% mean error. Experimental results also demonstrate that our model can be trained only once and applied to different designs using the same manufacturing process. Our fast and accurate wire timing prediction can easily be integrated into incremental timing optimization and expedites timing closure.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124868840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Tight Compression: Compressing CNN Model Tightly Through Unstructured Pruning and Simulated Annealing Based Permutation 紧压缩:通过非结构化剪枝和基于模拟退火的置换对CNN模型进行紧压缩
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218701
Xizi Chen, Jingyang Zhu, Jingbo Jiang, C. Tsui
The unstructured sparsity after pruning poses a challenge to the efficient implementation of deep learning models in existing regular architectures like systolic arrays. The coarse-grained structured pruning, on the other hand, tends to have higher accuracy loss than unstructured pruning when the pruned models are of the same size. In this work, we propose a compression method based on the unstructured pruning and a novel weight permutation scheme. Through permutation, the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Compared to the state-of-the-art works, the matrix compression rate is effectively improved from 5.88x to 10.28x. As a result, the throughput and energy efficiency are improved by 2.12 and 1.57 times, respectively.
修剪后的非结构化稀疏性对现有常规架构(如收缩数组)中深度学习模型的有效实现提出了挑战。而粗粒度的结构化剪枝,在剪枝模型大小相同的情况下,往往比非结构化剪枝具有更高的精度损失。在这项工作中,我们提出了一种基于非结构化剪枝和一种新的权重排列方案的压缩方法。通过置换,将稀疏权矩阵进一步压缩为小而密的格式,充分利用硬件资源。与最先进的作品相比,矩阵压缩率从5.88倍有效提高到10.28倍。因此,吞吐量和能源效率分别提高了2.12倍和1.57倍。
{"title":"Tight Compression: Compressing CNN Model Tightly Through Unstructured Pruning and Simulated Annealing Based Permutation","authors":"Xizi Chen, Jingyang Zhu, Jingbo Jiang, C. Tsui","doi":"10.1109/DAC18072.2020.9218701","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218701","url":null,"abstract":"The unstructured sparsity after pruning poses a challenge to the efficient implementation of deep learning models in existing regular architectures like systolic arrays. The coarse-grained structured pruning, on the other hand, tends to have higher accuracy loss than unstructured pruning when the pruned models are of the same size. In this work, we propose a compression method based on the unstructured pruning and a novel weight permutation scheme. Through permutation, the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Compared to the state-of-the-art works, the matrix compression rate is effectively improved from 5.88x to 10.28x. As a result, the throughput and energy efficiency are improved by 2.12 and 1.57 times, respectively.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125927851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Towards Memory-Efficient Streaming Processing with Counter-Cascading Sketching on FPGA 基于FPGA的反级联草图实现高效内存流处理
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218503
Minjin Tang, M. Wen, Junzhong Shen, Xiaolei Zhao, Chunyuan Zhang
Obtaining item frequencies in data streams with limited space is a well-recognized and challenging problem in a wide range of applications. Sketch-based solutions have been widely used to address this challenge due to their ability to accurately record the data streams at a low memory cost. However, most sketches suffer from low memory utilization due to the adoption of a fixed counter size. Accordingly, in this work, we propose a counter-cascading scheduling algorithm to maximize the memory utilization of sketches without incurring any accuracy loss. In addition, we propose an FPGA-based system design that supports sketch parameter learning, counter-cascading record and online query. We implement our designs on Xilinx VCU118, and conduct evaluations on real-world traces, thereby demonstrating that our design can achieve higher accuracy with lower storage; the performance achieved is 10× ∼ 20× better than that of state-of-the-art sketches.
在有限空间的数据流中获取项目频率是一个公认的具有挑战性的问题,在广泛的应用中。基于草图的解决方案已被广泛用于解决这一挑战,因为它们能够以较低的内存成本准确记录数据流。然而,由于采用固定的计数器大小,大多数草图的内存利用率都很低。因此,在这项工作中,我们提出了一种反级联调度算法,以最大限度地提高草图的内存利用率,而不会造成任何精度损失。此外,我们提出了一个基于fpga的系统设计,支持草图参数学习、反级联记录和在线查询。我们在Xilinx VCU118上实现了我们的设计,并对实际迹线进行了评估,从而证明我们的设计可以在更低的存储空间下实现更高的精度;其性能比目前最先进的草图提高了10倍~ 20倍。
{"title":"Towards Memory-Efficient Streaming Processing with Counter-Cascading Sketching on FPGA","authors":"Minjin Tang, M. Wen, Junzhong Shen, Xiaolei Zhao, Chunyuan Zhang","doi":"10.1109/DAC18072.2020.9218503","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218503","url":null,"abstract":"Obtaining item frequencies in data streams with limited space is a well-recognized and challenging problem in a wide range of applications. Sketch-based solutions have been widely used to address this challenge due to their ability to accurately record the data streams at a low memory cost. However, most sketches suffer from low memory utilization due to the adoption of a fixed counter size. Accordingly, in this work, we propose a counter-cascading scheduling algorithm to maximize the memory utilization of sketches without incurring any accuracy loss. In addition, we propose an FPGA-based system design that supports sketch parameter learning, counter-cascading record and online query. We implement our designs on Xilinx VCU118, and conduct evaluations on real-world traces, thereby demonstrating that our design can achieve higher accuracy with lower storage; the performance achieved is 10× ∼ 20× better than that of state-of-the-art sketches.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126647529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Tensor Virtualization Technique to Support Efficient Data Reorganization for CNN Accelerators 支持CNN加速器高效数据重组的张量虚拟化技术
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218726
Donghyun Kang, S. Ha
There is a growing need for data reorganization in recent neural networks for various applications such as Generative Adversarial Networks(GANs) that use transposed convolution and U-Net that requires upsampling. We propose a novel technique, called tensor virtualization technique, to perform data reorganization efficiently with a minimal hardware addition for adder-tree based CNN accelerators. In the proposed technique, a data reorganization request is specified with a few parameters and data reorganization is performed in the virtual space without overhead in the physical memory. It allows existing adder-tree-based CNN accelerators to accelerate a wide range of neural networks that require data reorganization, including U-Net, DCGAN, and SRGAN.
在最近的各种应用中,神经网络对数据重组的需求越来越大,例如使用转置卷积的生成对抗网络(GANs)和需要上采样的U-Net。我们提出了一种新的技术,称为张量虚拟化技术,以最小的硬件添加来有效地执行基于加法器树的CNN加速器的数据重组。在该技术中,通过少量参数指定数据重组请求,并在虚拟空间中执行数据重组,而不会增加物理内存的开销。它允许现有的基于加法器树的CNN加速器加速各种需要数据重组的神经网络,包括U-Net、DCGAN和SRGAN。
{"title":"Tensor Virtualization Technique to Support Efficient Data Reorganization for CNN Accelerators","authors":"Donghyun Kang, S. Ha","doi":"10.1109/DAC18072.2020.9218726","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218726","url":null,"abstract":"There is a growing need for data reorganization in recent neural networks for various applications such as Generative Adversarial Networks(GANs) that use transposed convolution and U-Net that requires upsampling. We propose a novel technique, called tensor virtualization technique, to perform data reorganization efficiently with a minimal hardware addition for adder-tree based CNN accelerators. In the proposed technique, a data reorganization request is specified with a few parameters and data reorganization is performed in the virtual space without overhead in the physical memory. It allows existing adder-tree-based CNN accelerators to accelerate a wide range of neural networks that require data reorganization, including U-Net, DCGAN, and SRGAN.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115325740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural Network-Based Side Channel Attacks and Countermeasures 基于神经网络的侧信道攻击与对抗
Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218511
D. Serpanos, Shengqi Yang, M. Wolf
This paper surveys results in the use of neural networks and deep learning in two areas of hardware security: power attacks and physically-unclonable functions (PUFs).
本文调查了神经网络和深度学习在两个硬件安全领域的应用结果:电源攻击和物理不可克隆功能(puf)。
{"title":"Neural Network-Based Side Channel Attacks and Countermeasures","authors":"D. Serpanos, Shengqi Yang, M. Wolf","doi":"10.1109/DAC18072.2020.9218511","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218511","url":null,"abstract":"This paper surveys results in the use of neural networks and deep learning in two areas of hardware security: power attacks and physically-unclonable functions (PUFs).","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122424249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2020 57th ACM/IEEE Design Automation Conference (DAC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1