首页 > 最新文献

Proceedings of the 59th ACM/IEEE Design Automation Conference最新文献

英文 中文
ABNN 2
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530680
Liyan Shen, Ye Dong, Binxing Fang, Jinqiao Shi, Xuebin Wang, Shengli Pan, Ruisheng Shi
Data privacy and security issues are preventing a lot of potential on-cloud machine learning as services from happening. In the recent past, secure multi-party computation (MPC) has been used to achieve the secure neural network predictions, guaranteeing the privacy of data. However, the cost of the existing two-party solutions is expensive and they are impractical in real-world setting. In this work, we utilize the advantages of quantized neural network (QNN) and MPC to present ABNN2, a practical secure two-party framework that can realize arbitrary-bitwidth quantized neural network predictions. Concretely, we propose an efficient and novel matrix multiplication protocol based on 1-out-of-N OT extension and optimize the the protocol through a parallel scheme. In addition, we design optimized protocol for the ReLU function. The experiments demonstrate that our protocols are about 2X-36X and 1.4X--7X faster than SecureML (S&P'17) and MiniONN (CCS'17) respectively. And ABNN2 obtain comparable efficiency as state of the art QNN prediction protocol QUOTIENT (CCS'19), but the later only supports ternary neural network.
{"title":"ABNN\u0000 <sup>2</sup>","authors":"Liyan Shen, Ye Dong, Binxing Fang, Jinqiao Shi, Xuebin Wang, Shengli Pan, Ruisheng Shi","doi":"10.1145/3489517.3530680","DOIUrl":"https://doi.org/10.1145/3489517.3530680","url":null,"abstract":"Data privacy and security issues are preventing a lot of potential on-cloud machine learning as services from happening. In the recent past, secure multi-party computation (MPC) has been used to achieve the secure neural network predictions, guaranteeing the privacy of data. However, the cost of the existing two-party solutions is expensive and they are impractical in real-world setting. In this work, we utilize the advantages of quantized neural network (QNN) and MPC to present ABNN2, a practical secure two-party framework that can realize arbitrary-bitwidth quantized neural network predictions. Concretely, we propose an efficient and novel matrix multiplication protocol based on 1-out-of-N OT extension and optimize the the protocol through a parallel scheme. In addition, we design optimized protocol for the ReLU function. The experiments demonstrate that our protocols are about 2X-36X and 1.4X--7X faster than SecureML (S&P'17) and MiniONN (CCS'17) respectively. And ABNN2 obtain comparable efficiency as state of the art QNN prediction protocol QUOTIENT (CCS'19), but the later only supports ternary neural network.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116833273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
PathFinder: side channel protection through automatic leaky paths identification and obfuscation PathFinder:侧通道保护,通过自动泄漏路径识别和混淆
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530413
Haocheng Ma, Qizhi Zhang, Ya Gao, Jiaji He, Yiqiang Zhao, Yier Jin
Side-channel analysis (SCA) attacks show an enormous threat to cryptographic integrated circuits (ICs). To address this threat, designers try to adopt various countermeasures during the IC development process. However, many existing solutions are costly in terms of area, power and/or performance, and may require full-custom circuit design for proper implementations. In this paper, we propose a tool, namely PathFinder, to automatically identify leaky paths and protect the design, and is compatible with the commercial design flow. The tool first finds out partial logic cells that leak the most information through dynamic correlation analysis. PathFinder then exploits static security checking to construct complete leaky paths based on these cells. After leaky paths are identified, PathFinder will leverage proper hardware countermeasures, including Boolean masking and random precharge, to eliminate information leakage from these paths. The effectiveness of PathFinder is validated both through simulation and physical measurements on FPGA implementations. Results demonstrate more than 1000X improvements on side-channel resistance, with less than 6.53% penalty to the power, area and performance.
侧信道分析(SCA)攻击对加密集成电路(ic)构成了巨大的威胁。为了解决这一威胁,设计人员在集成电路开发过程中尝试采用各种对策。然而,许多现有的解决方案在面积、功率和/或性能方面都是昂贵的,并且可能需要完全定制的电路设计才能正确实现。在本文中,我们提出了一个自动识别泄漏路径和保护设计的工具,即PathFinder,它与商业设计流程兼容。该工具首先通过动态关联分析找出泄漏信息最多的部分逻辑单元。PathFinder然后利用静态安全检查来基于这些单元构建完整的泄漏路径。在确定泄漏路径后,PathFinder将利用适当的硬件对策,包括布尔屏蔽和随机预充,以消除这些路径中的信息泄漏。通过FPGA实现的仿真和物理测量验证了PathFinder的有效性。结果表明,侧通道电阻提高了1000倍以上,功率、面积和性能的损失小于6.53%。
{"title":"PathFinder: side channel protection through automatic leaky paths identification and obfuscation","authors":"Haocheng Ma, Qizhi Zhang, Ya Gao, Jiaji He, Yiqiang Zhao, Yier Jin","doi":"10.1145/3489517.3530413","DOIUrl":"https://doi.org/10.1145/3489517.3530413","url":null,"abstract":"Side-channel analysis (SCA) attacks show an enormous threat to cryptographic integrated circuits (ICs). To address this threat, designers try to adopt various countermeasures during the IC development process. However, many existing solutions are costly in terms of area, power and/or performance, and may require full-custom circuit design for proper implementations. In this paper, we propose a tool, namely PathFinder, to automatically identify leaky paths and protect the design, and is compatible with the commercial design flow. The tool first finds out partial logic cells that leak the most information through dynamic correlation analysis. PathFinder then exploits static security checking to construct complete leaky paths based on these cells. After leaky paths are identified, PathFinder will leverage proper hardware countermeasures, including Boolean masking and random precharge, to eliminate information leakage from these paths. The effectiveness of PathFinder is validated both through simulation and physical measurements on FPGA implementations. Results demonstrate more than 1000X improvements on side-channel resistance, with less than 6.53% penalty to the power, area and performance.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116719264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SATO: spiking neural network acceleration via temporal-oriented dataflow and architecture SATO:通过面向时间的数据流和架构来加速神经网络
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530592
Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Tao Yang, Zhezhi He, Xiaokang Yang, Li Jiang
Event-driven spiking neural networks (SNNs) have shown great promise for being strikingly energy-efficient. SNN neurons integrate the spikes, accumulate the membrane potential, and fire output spike when the potential exceeds a threshold. Existing SNN accelerators, however, have to carry out such accumulation-comparison operation in serial. Repetitive spike generation at each time step not only increases latency as well as overall energy budget, but also incurs memory access overhead of fetching membrane potentials, both of which lessen the efficiency of SNN accelerators. Meanwhile, inherent highly sparse spikes of SNNs lead to imbalanced workloads among neurons that hurdle the utilization of processing elements (PEs). This paper proposes SATO, a temporal-parallel SNN accelerator that accumulates the membrane potential for all time steps in parallel. SATO architecture contains a novel binary adder-search tree to generate the output spike train, which decouples the chronological dependence in the accumulation-comparison operation. Moreover, SATO can evenly dispatch the compressed workloads to all PEs with maximized data locality of input spike trains based on a bucket-sort-based method. Our evaluations show that SATO outperforms the previous ANN accelerator 8-bit version of "Eyeriss" by 30.9× in terms of speedup and 12.3×, in terms of energy-saving. Compared with the state-of-the-art SNN accelerator "SpinalFlow", SATO can also achieve 6.4× performance gain and 4.8× energy reduction, which is quite impressive for inference.
事件驱动的峰值神经网络(snn)已经显示出惊人的节能前景。SNN神经元整合这些峰,积累膜电位,并在电位超过阈值时发出输出峰。然而,现有的SNN加速器必须串行地进行这种累加比较操作。在每个时间步重复产生尖峰不仅增加了延迟和总能量收支,而且还增加了获取膜电位的存储器访问开销,这两者都降低了SNN加速器的效率。同时,snn固有的高度稀疏峰值导致神经元之间的工作负载不平衡,阻碍了处理元素(PEs)的利用。本文提出了一种时间平行SNN加速器SATO,它可以平行地积累所有时间步长的膜电位。SATO架构包含一种新颖的二加法器搜索树来生成输出尖峰序列,从而解耦了累加比较操作中的时间依赖性。此外,SATO还可以基于桶排序的方法,将压缩后的工作负载均匀地分配到具有最大输入尖峰序列数据局部性的所有pe上。我们的评估表明,SATO比之前的ANN加速器8位版本的“Eyeriss”在加速方面提高了30.9倍,在节能方面提高了12.3倍。与最先进的SNN加速器“SpinalFlow”相比,SATO还可以实现6.4倍的性能增益和4.8倍的能量降低,这对于推理来说是相当令人印象深刻的。
{"title":"SATO: spiking neural network acceleration via temporal-oriented dataflow and architecture","authors":"Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Tao Yang, Zhezhi He, Xiaokang Yang, Li Jiang","doi":"10.1145/3489517.3530592","DOIUrl":"https://doi.org/10.1145/3489517.3530592","url":null,"abstract":"Event-driven spiking neural networks (SNNs) have shown great promise for being strikingly energy-efficient. SNN neurons integrate the spikes, accumulate the membrane potential, and fire output spike when the potential exceeds a threshold. Existing SNN accelerators, however, have to carry out such accumulation-comparison operation in serial. Repetitive spike generation at each time step not only increases latency as well as overall energy budget, but also incurs memory access overhead of fetching membrane potentials, both of which lessen the efficiency of SNN accelerators. Meanwhile, inherent highly sparse spikes of SNNs lead to imbalanced workloads among neurons that hurdle the utilization of processing elements (PEs). This paper proposes SATO, a temporal-parallel SNN accelerator that accumulates the membrane potential for all time steps in parallel. SATO architecture contains a novel binary adder-search tree to generate the output spike train, which decouples the chronological dependence in the accumulation-comparison operation. Moreover, SATO can evenly dispatch the compressed workloads to all PEs with maximized data locality of input spike trains based on a bucket-sort-based method. Our evaluations show that SATO outperforms the previous ANN accelerator 8-bit version of \"Eyeriss\" by 30.9× in terms of speedup and 12.3×, in terms of energy-saving. Compared with the state-of-the-art SNN accelerator \"SpinalFlow\", SATO can also achieve 6.4× performance gain and 4.8× energy reduction, which is quite impressive for inference.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117180890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A timing engine inspired graph neural network model for pre-routing slack prediction 基于时序引擎的图神经网络预路由松弛预测模型
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530597
Zizheng Guo, Mingjie Liu, Jiaqi Gu, Shuhan Zhang, D. Pan, Yibo Lin
Fast and accurate pre-routing timing prediction is essential for timing-driven placement since repetitive routing and static timing analysis (STA) iterations are expensive and unacceptable. Prior work on timing prediction aims at estimating net delay and slew, lacking the ability to model global timing metrics. In this work, we present a timing engine inspired graph neural network (GNN) to predict arrival time and slack at timing endpoints. We further leverage edge delays as local auxiliary tasks to facilitate model training with increased model performance. Experimental results on real-world open-source designs demonstrate improved model accuracy and explainability when compared with vanilla deep GNN models.
快速和准确的预路由时间预测对于时间驱动的布局至关重要,因为重复的路由和静态定时分析(STA)迭代是昂贵且不可接受的。先前的时序预测工作旨在估计净延迟和回转,缺乏对全局时序指标建模的能力。在这项工作中,我们提出了一个授时引擎启发的图神经网络(GNN)来预测到达时间和在授时端点的松弛时间。我们进一步利用边缘延迟作为局部辅助任务,以提高模型性能来促进模型训练。真实世界开源设计的实验结果表明,与传统的深度GNN模型相比,模型的准确性和可解释性有所提高。
{"title":"A timing engine inspired graph neural network model for pre-routing slack prediction","authors":"Zizheng Guo, Mingjie Liu, Jiaqi Gu, Shuhan Zhang, D. Pan, Yibo Lin","doi":"10.1145/3489517.3530597","DOIUrl":"https://doi.org/10.1145/3489517.3530597","url":null,"abstract":"Fast and accurate pre-routing timing prediction is essential for timing-driven placement since repetitive routing and static timing analysis (STA) iterations are expensive and unacceptable. Prior work on timing prediction aims at estimating net delay and slew, lacking the ability to model global timing metrics. In this work, we present a timing engine inspired graph neural network (GNN) to predict arrival time and slack at timing endpoints. We further leverage edge delays as local auxiliary tasks to facilitate model training with increased model performance. Experimental results on real-world open-source designs demonstrate improved model accuracy and explainability when compared with vanilla deep GNN models.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115521897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Optimizing quantum circuit synthesis for permutations using recursion 利用递归优化排列的量子电路合成
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530654
C.‐Y. Chen, B. Schmitt, Helena Zhang, L. Bishop, Ali Javadi-Abhari
We describe a family of recursive methods for the synthesis of qubit permutations on quantum computers with limited qubit connectivity. Two objectives are of importance: circuit size and depth. In each case we combine a scalable heuristic with a non-scalable, yet exact, synthesis.
我们描述了一组在有限量子比特连通性的量子计算机上合成量子比特排列的递归方法。两个目标是重要的:电路的尺寸和深度。在每种情况下,我们都将可扩展的启发式与不可扩展但精确的合成相结合。
{"title":"Optimizing quantum circuit synthesis for permutations using recursion","authors":"C.‐Y. Chen, B. Schmitt, Helena Zhang, L. Bishop, Ali Javadi-Abhari","doi":"10.1145/3489517.3530654","DOIUrl":"https://doi.org/10.1145/3489517.3530654","url":null,"abstract":"We describe a family of recursive methods for the synthesis of qubit permutations on quantum computers with limited qubit connectivity. Two objectives are of importance: circuit size and depth. In each case we combine a scalable heuristic with a non-scalable, yet exact, synthesis.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114394256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Differentiable-timing-driven global placement 可微分时间驱动的全局布局
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530486
Zizheng Guo, Yibo Lin
Placement is critical to the timing closure of the very-large-scale integrated (VLSI) circuit design flow. This paper proposes a differentiable-timing-driven global placement framework inspired by deep neural networks. By establishing the analogy between static timing analysis and neural network propagation, we propose a differentiable timing objective for placement to explicitly optimize timing metrics such as total negative slack (TNS) and worst negative slack (WNS). The framework can achieve at most 32.7% and 59.1% improvements on WNS and TNS respectively compared with the state-of-the-art timing-driven placer, and achieve 1.80× speed-up when both running on GPU.
放置对于超大规模集成电路(VLSI)设计流程的时序闭合至关重要。本文提出了一种基于深度神经网络的微定时驱动全局布局框架。通过将静态定时分析与神经网络传播进行类比,提出了一种可微定时目标,以显式优化总负松弛(TNS)和最坏负松弛(WNS)等定时指标。该框架在WNS和TNS上分别比最先进的时序驱动砂矿机提高32.7%和59.1%,在GPU上运行时达到1.80倍的加速。
{"title":"Differentiable-timing-driven global placement","authors":"Zizheng Guo, Yibo Lin","doi":"10.1145/3489517.3530486","DOIUrl":"https://doi.org/10.1145/3489517.3530486","url":null,"abstract":"Placement is critical to the timing closure of the very-large-scale integrated (VLSI) circuit design flow. This paper proposes a differentiable-timing-driven global placement framework inspired by deep neural networks. By establishing the analogy between static timing analysis and neural network propagation, we propose a differentiable timing objective for placement to explicitly optimize timing metrics such as total negative slack (TNS) and worst negative slack (WNS). The framework can achieve at most 32.7% and 59.1% improvements on WNS and TNS respectively compared with the state-of-the-art timing-driven placer, and achieve 1.80× speed-up when both running on GPU.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115299875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
XMA
Pub Date : 2022-07-10 DOI: 10.1007/978-3-642-33335-4_240007
Fan Zhang, Li Yang, Jian Meng, J.-s. Seo, Yu Cao, Deliang Fan
{"title":"XMA","authors":"Fan Zhang, Li Yang, Jian Meng, J.-s. Seo, Yu Cao, Deliang Fan","doi":"10.1007/978-3-642-33335-4_240007","DOIUrl":"https://doi.org/10.1007/978-3-642-33335-4_240007","url":null,"abstract":"","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125409745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization: late breaking results 基于fpga的混合方案量化视觉变压器自动加速框架:迟破结果
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530618
Mengshu Sun, Z. Li, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, M. Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.31% to 1.25% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6× improvement on the frame rate (i.e., 56.4 FPS vs. 10.0 FPS) with 0.83% accuracy drop for DeiT-base.
视觉变压器(ViTs)的出现大大提高了计算机视觉任务的精度。然而,它们复杂的体系结构和巨大的计算/存储需求迫切需要新的硬件加速器设计方法。本文提出了一种基于混合方案量化的fpga感知ViT自动加速框架。据我们所知,这是第一个基于fpga的ViT加速框架探索模型量化。与最先进的ViT量化工作(仅采用不带硬件加速的算法方法)相比,我们的量化在相同比特宽度下的Top-1精度提高了0.31%至1.25%。与32位浮点基准FPGA加速器相比,我们的加速器在帧速率上实现了大约5.6倍的改进(即56.4 FPS vs. 10.0 FPS),在DeiT-base上精度下降了0.83%。
{"title":"FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization: late breaking results","authors":"Mengshu Sun, Z. Li, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, M. Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang","doi":"10.1145/3489517.3530618","DOIUrl":"https://doi.org/10.1145/3489517.3530618","url":null,"abstract":"Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.31% to 1.25% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6× improvement on the frame rate (i.e., 56.4 FPS vs. 10.0 FPS) with 0.83% accuracy drop for DeiT-base.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126168062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
PriMax
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530431
Nicholas Wendt, Todd M. Austin, V. Bertacco
Domain-specific languages (DSLs) improve developer productivity by abstracting away low-level details of an algorithm's implementation within a specialized domain. These languages often provide powerful primitives to describe complex operations, potentially granting flexibility during compilation to target hardware acceleration. This work proposes PriMax, a novel methodology to effectively map DSL applications to hardware accelerators. It builds decision trees based on benchmark results, which select between distinct implementations of accelerated primitives to maximize a target performance metric. In our graph analytics case study with two accelerators, PriMax produces a geometric mean speedup of 1.57x over a multicore CPU, higher than either target accelerator alone, and approaching the maximum 1.58x speedup attainable with these target accelerators.
{"title":"PriMax","authors":"Nicholas Wendt, Todd M. Austin, V. Bertacco","doi":"10.1145/3489517.3530431","DOIUrl":"https://doi.org/10.1145/3489517.3530431","url":null,"abstract":"Domain-specific languages (DSLs) improve developer productivity by abstracting away low-level details of an algorithm's implementation within a specialized domain. These languages often provide powerful primitives to describe complex operations, potentially granting flexibility during compilation to target hardware acceleration. This work proposes PriMax, a novel methodology to effectively map DSL applications to hardware accelerators. It builds decision trees based on benchmark results, which select between distinct implementations of accelerated primitives to maximize a target performance metric. In our graph analytics case study with two accelerators, PriMax produces a geometric mean speedup of 1.57x over a multicore CPU, higher than either target accelerator alone, and approaching the maximum 1.58x speedup attainable with these target accelerators.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126225927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EBSP: evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks 演化的位稀疏模式用于量化深度神经网络的硬件友好推理
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530660
Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Zhezhi He, Naifeng Jing, Xiaoyao Liang, Li Jiang
Model compression has been extensively investigated for supporting efficient neural network inference on edge-computing platforms due to the huge model size and computation amount. Recent researches embrace joint-way compression across multiple techniques for extreme compression. However, most joint-way methods adopt a naive solution that applies two approaches sequentially, which can be sub-optimal, as it lacks a systematic approach to incorporate them. This paper proposes the integration of aggressive joint-way compression into hardware design, namely EBSP. It is motivated by 1) the quantization allows simplifying hardware implementations; 2) the bit distribution of quantized weights can be viewed as an independent trainable variable; 3) the exploitation of bit sparsity in the quantized network has the potential to achieve better performance. To achieve that, this paper introduces the bit sparsity patterns to construct both highly expressive and inherently regular bit distribution in the quantized network. We further incorporate our sparsity constraint in training to evolve inherently bit distributions to the bit sparsity pattern. Moreover, the structure of the introduced bit sparsity pattern engenders minimum hardware implementation under competitive classification accuracy. Specifically, the quantized network constrained by bit sparsity pattern can be processed using LUTs with the fewest entries instead of multipliers in minimally modified computational hardware. Our experiments show that compared to Eyeriss, BitFusion, WAX, and OLAccel, EBSP with less than 0.8% accuracy loss, can achieve 87.3%, 79.7%, 75.2% and 58.9% energy reduction and 93.8%, 83.7%, 72.7% and 49.5% performance gain on average, respectively.
由于模型大小和计算量巨大,为了支持边缘计算平台上高效的神经网络推理,模型压缩已经得到了广泛的研究。最近的研究涵盖了多种极端压缩技术的联合压缩。然而,大多数联合方法采用一种简单的解决方案,顺序地应用两种方法,这可能是次优的,因为它缺乏一种系统的方法来合并它们。本文提出了将主动联合压缩集成到硬件设计中,即EBSP。其动机是:1)量化允许简化硬件实现;2)量化权重的位分布可以看作是一个独立的可训练变量;3)在量化网络中利用比特稀疏有可能获得更好的性能。为了实现这一目标,本文引入了比特稀疏模式,以在量化网络中构建具有高表达性和固有规则的比特分布。我们进一步将稀疏性约束纳入到训练中,将固有的位分布演化为位稀疏模式。此外,所引入的位稀疏模式的结构在竞争性分类精度下实现了最小的硬件实现。具体来说,受位稀疏模式约束的量化网络可以在最小修改的计算硬件中使用具有最少条目的lut而不是乘法器来处理。我们的实验表明,与Eyeriss、BitFusion、WAX和OLAccel相比,EBSP在精度损失小于0.8%的情况下,平均能量降低了87.3%、79.7%、75.2%和58.9%,性能提高了93.8%、83.7%、72.7%和49.5%。
{"title":"EBSP: evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks","authors":"Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Zhezhi He, Naifeng Jing, Xiaoyao Liang, Li Jiang","doi":"10.1145/3489517.3530660","DOIUrl":"https://doi.org/10.1145/3489517.3530660","url":null,"abstract":"Model compression has been extensively investigated for supporting efficient neural network inference on edge-computing platforms due to the huge model size and computation amount. Recent researches embrace joint-way compression across multiple techniques for extreme compression. However, most joint-way methods adopt a naive solution that applies two approaches sequentially, which can be sub-optimal, as it lacks a systematic approach to incorporate them. This paper proposes the integration of aggressive joint-way compression into hardware design, namely EBSP. It is motivated by 1) the quantization allows simplifying hardware implementations; 2) the bit distribution of quantized weights can be viewed as an independent trainable variable; 3) the exploitation of bit sparsity in the quantized network has the potential to achieve better performance. To achieve that, this paper introduces the bit sparsity patterns to construct both highly expressive and inherently regular bit distribution in the quantized network. We further incorporate our sparsity constraint in training to evolve inherently bit distributions to the bit sparsity pattern. Moreover, the structure of the introduced bit sparsity pattern engenders minimum hardware implementation under competitive classification accuracy. Specifically, the quantized network constrained by bit sparsity pattern can be processed using LUTs with the fewest entries instead of multipliers in minimally modified computational hardware. Our experiments show that compared to Eyeriss, BitFusion, WAX, and OLAccel, EBSP with less than 0.8% accuracy loss, can achieve 87.3%, 79.7%, 75.2% and 58.9% energy reduction and 93.8%, 83.7%, 72.7% and 49.5% performance gain on average, respectively.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115107551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 59th ACM/IEEE Design Automation Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1