Proceedings of the 59th ACM/IEEE Design Automation Conference最新文献

英文中文

High-level design methods for hardware security: is it the right choice? invited 硬件安全的高级设计方法:是正确的选择吗?邀请

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530635

C. Pilato, D. Sciuto, Benjamin Tan, S. Garg, R. Karri

Due to the globalization of the electronics supply chain, hardware engineers are increasingly interested in modifying their chip designs to protect their intellectual property (IP) or the privacy of the final users. However, the integration of state-of-the-art solutions for hardware and hardware-assisted security is not fully automated, requiring the amendment of stable tools and industrial toolchains. This significantly limits the application in industrial designs, potentially affecting the security of the resulting chips. We discuss how existing solutions can be adapted to implement security features at higher levels of abstractions (during high-level synthesis or directly at the register-transfer level) and complement current industrial design and verification flows. Our modular framework allows designers to compose these solutions and create additional protection layers.

由于电子供应链的全球化，硬件工程师越来越有兴趣修改他们的芯片设计，以保护他们的知识产权(IP)或最终用户的隐私。然而，硬件和硬件辅助安全的最先进解决方案的集成并不是完全自动化的，需要修改稳定的工具和工业工具链。这极大地限制了其在工业设计中的应用，潜在地影响了芯片的安全性。我们讨论了如何调整现有的解决方案，以便在更高的抽象级别(在高级合成期间或直接在寄存器-传输级别)实现安全特性，并补充当前的工业设计和验证流程。我们的模块化框架允许设计人员组合这些解决方案并创建额外的保护层。

引用次数: 0

H2H H2H

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530509

Xinyi Zhang, Cong Hao, P. Zhou, A. Jones, Jingtong Hu

The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at https://github.com/xyzxinyizhang/H2H.

{"title":"H2H","authors":"Xinyi Zhang, Cong Hao, P. Zhou, A. Jones, Jingtong Hu","doi":"10.1145/3489517.3530509","DOIUrl":"https://doi.org/10.1145/3489517.3530509","url":null,"abstract":"The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at https://github.com/xyzxinyizhang/H2H.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127464376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations 实现快速不确定性估计:通过算法和硬件优化加速贝叶斯变压器

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530451

Hongxiang Fan, Martin Ferianc, W. Luk

Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.

量化神经网络(nn)的不确定性已被许多安全关键应用(如自动驾驶或医疗诊断)所要求。最近，贝叶斯变压器已经证明了它们在提供高质量的不确定性估计方面的能力，并且具有很高的精度。然而，它们的实时部署受到变压器体系结构核心的计算密集型关注机制和用于量化预测不确定性的重复蒙特卡罗采样的限制。为了解决这些限制，本文通过算法和硬件优化来加速贝叶斯变换。在算法层面，提出了一种基于进化算法(EA)的框架，利用贝叶斯变换的稀疏性，减轻其计算量。在硬件层面，我们证明了稀疏性在我们优化的CPU和GPU实现上带来了硬件性能的提高。提出了一种可适应的硬件架构，在FPGA上加速贝叶斯变压器。大量的实验表明，基于ea的框架，加上硬件优化，在CPU、GPU和FPGA平台上分别将贝叶斯变压器的延迟降低了13倍、12倍和20倍，同时实现了更高的算法性能。

{"title":"Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations","authors":"Hongxiang Fan, Martin Ferianc, W. Luk","doi":"10.1145/3489517.3530451","DOIUrl":"https://doi.org/10.1145/3489517.3530451","url":null,"abstract":"Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127530105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

YOLoC YOLoC

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530576

Yiming Chen, Guodong Yin, Zhanhong Tan, Ming-En Lee, Zekun Yang, Yongpan Liu, Huazhong Yang, Kaisheng Ma, Xueqing Li

Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to achieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (DarkNet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (−0.5% ~ +0.2%), compared with the fully SRAM-based CiM.

{"title":"YOLoC","authors":"Yiming Chen, Guodong Yin, Zhanhong Tan, Ming-En Lee, Zekun Yang, Yongpan Liu, Huazhong Yang, Kaisheng Ma, Xueqing Li","doi":"10.1145/3489517.3530576","DOIUrl":"https://doi.org/10.1145/3489517.3530576","url":null,"abstract":"Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to achieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (DarkNet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (−0.5% ~ +0.2%), compared with the fully SRAM-based CiM.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

O'clock: lock the clock via clock-gating for SoC IP protection O'clock:通过时钟门控锁定SoC IP保护时钟

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530542

M. S. Rahman, Rui Guo, Hadi Mardani Kamali, Fahim Rahman, Farimah Farahmandi, M. Abdel-Moneum, M. Tehranipoor

Existing logic locking techniques can prevent IP piracy or tampering. However, they often come at the expense of high overhead and are gradually becoming vulnerable to emerging deobfuscation attacks. To protect SoC IPs, we propose O'Clock, a fully-automated clock-gating-based approach that 'locks the clock' to protect IPs in complex SoCs. O'Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree. O'Clock has minimal changes to the original design and no change to the IC design flow. Our experimental results show its high resiliency against state-of-the-art de-obfuscation attacks (e.g., oracle-guided SAT, unrolling-/BMC-based SAT, removal, and oracle-less machine learning-based attacks) at negligible power, performance, and area (PPA) overhead.

现有的逻辑锁定技术可以防止IP盗版或篡改。然而，它们通常以高开销为代价，并且逐渐变得容易受到新出现的去混淆攻击。为了保护SoC ip，我们提出了O' clock，这是一种全自动的基于时钟的方法，可以“锁定时钟”以保护复杂SoC中的ip。O'Clock阻塞数据/控制流，并通过操纵时钟树的活动因子，使底层逻辑因错误的键而失效。O'Clock对原始设计的更改很小，对IC设计流程没有更改。我们的实验结果表明，它对最先进的去混淆攻击(例如，oracle引导的SAT，基于展开/ bmc的SAT，移除和基于oracle的无机器学习的攻击)具有很高的弹性，功耗，性能和面积(PPA)开销可以忽略。

引用次数: 11

GATSPI

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530601

Yanqing Zhang, Haoxing Ren, Akshay Sridharan, Brucek Khailany

In this paper, we present GATSPI, a novel GPU accelerated logic gate simulator that enables ultra-fast power estimation for industry-sized ASIC designs with millions of gates. GATSPI is written in PyTorch with custom CUDA kernels for ease of coding and maintainability. It achieves simulation kernel speedup of up to 1668X on a single-GPU system and up to 7412X on a multiple-GPU system when compared to a commercial gate-level simulator running on a single CPU core. GATSPI supports a range of simple to complex cell types from an industry standard cell library and SDF conditional delay statements without requiring prior calibration runs and produces industry-standard SAIF files from delay-aware gate-level simulation. Finally, we deploy GATSPI in a glitch-optimization flow, achieving a 1.4% power saving with a 449X speedup in turnaround time compared to a similar flow using a commercial simulator.

引用次数: 6

A2-ILT: GPU accelerated ILT with spatial attention mechanism A2-ILT: GPU加速ILT，具有空间注意机制

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530579

Qijing Wang, Bentian Jiang, Martin D. F. Wong, Evangeline F. Y. Young

Inverse lithography technology (ILT) is one of the promising resolution enhancement techniques (RETs) in modern design-for-manufacturing closure, however, it suffers from huge computational overhead and unaffordable mask writing time. In this paper, we propose A2-ILT, a GPU-accelerated ILT framework with spatial attention mechanism. Based on the previous GPU-accelerated ILT flow, we significantly improve the ILT quality by introducing spatial attention map and on-the-fly mask rectilinearization, and strengthen the robustness by Reinforcement-Learning deployment. Experimental results show that, comparing to the state-of-the-art solutions, A2-ILT achieves 5.06% and 11.60% reduction in printing error and process variation band with a lower mask complexity and superior runtime performance.

逆光刻技术(ILT)是现代面向制造的设计封装中很有前途的分辨率增强技术(ret)之一，但它存在巨大的计算开销和难以承受的掩模写入时间。在本文中，我们提出了一个基于gpu加速的空间注意机制ILT框架A2-ILT。在原有gpu加速ILT流的基础上，引入空间注意图和动态掩模线性化，显著提高了ILT质量，并通过强化学习部署增强了鲁棒性。实验结果表明，与现有解决方案相比，A2-ILT的打印误差和工艺变化幅度分别降低了5.06%和11.60%，且掩模复杂度较低，运行时性能优越。

引用次数: 5

CarM CarM

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1163/2330-4804_eiro_com_7474

Soobee Lee, Minindu Weerakoon, Jong-Sung Choi, Minjia Zhang, Di Wang, Myeongjae Jeon

引用次数: 0

Optimizing parallel PREM compilation over nested loop structures 优化嵌套循环结构上的并行PREM编译

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530610

Zhao Gu, R. Pellizzoni

We consider automatic parallelization of a computational kernel executed according to the PRedictable Execution Model (PREM), where each thread is divided into execution and memory phases. We target a scratchpad-based architecture, where memory phases are executed by a dedicated DMA component. We employ data analysis and loop tiling to split the kernel execution into segments, and schedule them based on a DAG representation of data and execution dependencies. Our main observation is that properly selecting tile sizes is key to optimize the makespan of the kernel. We thus propose a heuristic that efficiently searches for optimized tile size and core assignments over deeply nested loops, and demonstrate its applicability and performance compared to the state-of-the-art in PREM compilation using the PolyBench-NN benchmark suite.

我们考虑根据可预测执行模型(PREM)执行的计算内核的自动并行化，其中每个线程分为执行阶段和内存阶段。我们的目标是基于刮擦板的架构，其中内存阶段由专用的DMA组件执行。我们使用数据分析和循环平铺将内核执行分割成段，并根据数据和执行依赖关系的DAG表示对它们进行调度。我们的主要观察是，正确选择tile大小是优化内核makespan的关键。因此，我们提出了一种启发式方法，可以在深度嵌套循环中有效地搜索优化的tile大小和核心分配，并使用PolyBench-NN基准套件与最先进的PREM编译相比，展示其适用性和性能。

引用次数: 0

MATCHA

Proceedings of the 59th ACM/IEEE Design Automation Conference

Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530435

Lei Jiang, Qian Lou, Nrushad Joshi

Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.

引用次数: 19

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 59th ACM/IEEE Design Automation Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀