首页 > 最新文献

Proceedings of the 59th ACM/IEEE Design Automation Conference最新文献

英文 中文
High-level design methods for hardware security: is it the right choice? invited 硬件安全的高级设计方法:是正确的选择吗?邀请
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530635
C. Pilato, D. Sciuto, Benjamin Tan, S. Garg, R. Karri
Due to the globalization of the electronics supply chain, hardware engineers are increasingly interested in modifying their chip designs to protect their intellectual property (IP) or the privacy of the final users. However, the integration of state-of-the-art solutions for hardware and hardware-assisted security is not fully automated, requiring the amendment of stable tools and industrial toolchains. This significantly limits the application in industrial designs, potentially affecting the security of the resulting chips. We discuss how existing solutions can be adapted to implement security features at higher levels of abstractions (during high-level synthesis or directly at the register-transfer level) and complement current industrial design and verification flows. Our modular framework allows designers to compose these solutions and create additional protection layers.
由于电子供应链的全球化,硬件工程师越来越有兴趣修改他们的芯片设计,以保护他们的知识产权(IP)或最终用户的隐私。然而,硬件和硬件辅助安全的最先进解决方案的集成并不是完全自动化的,需要修改稳定的工具和工业工具链。这极大地限制了其在工业设计中的应用,潜在地影响了芯片的安全性。我们讨论了如何调整现有的解决方案,以便在更高的抽象级别(在高级合成期间或直接在寄存器-传输级别)实现安全特性,并补充当前的工业设计和验证流程。我们的模块化框架允许设计人员组合这些解决方案并创建额外的保护层。
{"title":"High-level design methods for hardware security: is it the right choice? invited","authors":"C. Pilato, D. Sciuto, Benjamin Tan, S. Garg, R. Karri","doi":"10.1145/3489517.3530635","DOIUrl":"https://doi.org/10.1145/3489517.3530635","url":null,"abstract":"Due to the globalization of the electronics supply chain, hardware engineers are increasingly interested in modifying their chip designs to protect their intellectual property (IP) or the privacy of the final users. However, the integration of state-of-the-art solutions for hardware and hardware-assisted security is not fully automated, requiring the amendment of stable tools and industrial toolchains. This significantly limits the application in industrial designs, potentially affecting the security of the resulting chips. We discuss how existing solutions can be adapted to implement security features at higher levels of abstractions (during high-level synthesis or directly at the register-transfer level) and complement current industrial design and verification flows. Our modular framework allows designers to compose these solutions and create additional protection layers.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115396770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H2H H2H
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530509
Xinyi Zhang, Cong Hao, P. Zhou, A. Jones, Jingtong Hu
The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at https://github.com/xyzxinyizhang/H2H.
{"title":"H2H","authors":"Xinyi Zhang, Cong Hao, P. Zhou, A. Jones, Jingtong Hu","doi":"10.1145/3489517.3530509","DOIUrl":"https://doi.org/10.1145/3489517.3530509","url":null,"abstract":"The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at https://github.com/xyzxinyizhang/H2H.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127464376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations 实现快速不确定性估计:通过算法和硬件优化加速贝叶斯变压器
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530451
Hongxiang Fan, Martin Ferianc, W. Luk
Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.
量化神经网络(nn)的不确定性已被许多安全关键应用(如自动驾驶或医疗诊断)所要求。最近,贝叶斯变压器已经证明了它们在提供高质量的不确定性估计方面的能力,并且具有很高的精度。然而,它们的实时部署受到变压器体系结构核心的计算密集型关注机制和用于量化预测不确定性的重复蒙特卡罗采样的限制。为了解决这些限制,本文通过算法和硬件优化来加速贝叶斯变换。在算法层面,提出了一种基于进化算法(EA)的框架,利用贝叶斯变换的稀疏性,减轻其计算量。在硬件层面,我们证明了稀疏性在我们优化的CPU和GPU实现上带来了硬件性能的提高。提出了一种可适应的硬件架构,在FPGA上加速贝叶斯变压器。大量的实验表明,基于ea的框架,加上硬件优化,在CPU、GPU和FPGA平台上分别将贝叶斯变压器的延迟降低了13倍、12倍和20倍,同时实现了更高的算法性能。
{"title":"Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations","authors":"Hongxiang Fan, Martin Ferianc, W. Luk","doi":"10.1145/3489517.3530451","DOIUrl":"https://doi.org/10.1145/3489517.3530451","url":null,"abstract":"Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127530105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
YOLoC YOLoC
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530576
Yiming Chen, Guodong Yin, Zhanhong Tan, Ming-En Lee, Zekun Yang, Yongpan Liu, Huazhong Yang, Kaisheng Ma, Xueqing Li
Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to achieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (DarkNet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (−0.5% ~ +0.2%), compared with the fully SRAM-based CiM.
{"title":"YOLoC","authors":"Yiming Chen, Guodong Yin, Zhanhong Tan, Ming-En Lee, Zekun Yang, Yongpan Liu, Huazhong Yang, Kaisheng Ma, Xueqing Li","doi":"10.1145/3489517.3530576","DOIUrl":"https://doi.org/10.1145/3489517.3530576","url":null,"abstract":"Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to achieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (DarkNet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (−0.5% ~ +0.2%), compared with the fully SRAM-based CiM.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
O'clock: lock the clock via clock-gating for SoC IP protection O'clock:通过时钟门控锁定SoC IP保护时钟
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530542
M. S. Rahman, Rui Guo, Hadi Mardani Kamali, Fahim Rahman, Farimah Farahmandi, M. Abdel-Moneum, M. Tehranipoor
Existing logic locking techniques can prevent IP piracy or tampering. However, they often come at the expense of high overhead and are gradually becoming vulnerable to emerging deobfuscation attacks. To protect SoC IPs, we propose O'Clock, a fully-automated clock-gating-based approach that 'locks the clock' to protect IPs in complex SoCs. O'Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree. O'Clock has minimal changes to the original design and no change to the IC design flow. Our experimental results show its high resiliency against state-of-the-art de-obfuscation attacks (e.g., oracle-guided SAT, unrolling-/BMC-based SAT, removal, and oracle-less machine learning-based attacks) at negligible power, performance, and area (PPA) overhead.
现有的逻辑锁定技术可以防止IP盗版或篡改。然而,它们通常以高开销为代价,并且逐渐变得容易受到新出现的去混淆攻击。为了保护SoC ip,我们提出了O' clock,这是一种全自动的基于时钟的方法,可以“锁定时钟”以保护复杂SoC中的ip。O'Clock阻塞数据/控制流,并通过操纵时钟树的活动因子,使底层逻辑因错误的键而失效。O'Clock对原始设计的更改很小,对IC设计流程没有更改。我们的实验结果表明,它对最先进的去混淆攻击(例如,oracle引导的SAT,基于展开/ bmc的SAT,移除和基于oracle的无机器学习的攻击)具有很高的弹性,功耗,性能和面积(PPA)开销可以忽略。
{"title":"O'clock: lock the clock via clock-gating for SoC IP protection","authors":"M. S. Rahman, Rui Guo, Hadi Mardani Kamali, Fahim Rahman, Farimah Farahmandi, M. Abdel-Moneum, M. Tehranipoor","doi":"10.1145/3489517.3530542","DOIUrl":"https://doi.org/10.1145/3489517.3530542","url":null,"abstract":"Existing logic locking techniques can prevent IP piracy or tampering. However, they often come at the expense of high overhead and are gradually becoming vulnerable to emerging deobfuscation attacks. To protect SoC IPs, we propose O'Clock, a fully-automated clock-gating-based approach that 'locks the clock' to protect IPs in complex SoCs. O'Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree. O'Clock has minimal changes to the original design and no change to the IC design flow. Our experimental results show its high resiliency against state-of-the-art de-obfuscation attacks (e.g., oracle-guided SAT, unrolling-/BMC-based SAT, removal, and oracle-less machine learning-based attacks) at negligible power, performance, and area (PPA) overhead.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126936767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
GATSPI
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530601
Yanqing Zhang, Haoxing Ren, Akshay Sridharan, Brucek Khailany
In this paper, we present GATSPI, a novel GPU accelerated logic gate simulator that enables ultra-fast power estimation for industry-sized ASIC designs with millions of gates. GATSPI is written in PyTorch with custom CUDA kernels for ease of coding and maintainability. It achieves simulation kernel speedup of up to 1668X on a single-GPU system and up to 7412X on a multiple-GPU system when compared to a commercial gate-level simulator running on a single CPU core. GATSPI supports a range of simple to complex cell types from an industry standard cell library and SDF conditional delay statements without requiring prior calibration runs and produces industry-standard SAIF files from delay-aware gate-level simulation. Finally, we deploy GATSPI in a glitch-optimization flow, achieving a 1.4% power saving with a 449X speedup in turnaround time compared to a similar flow using a commercial simulator.
{"title":"GATSPI","authors":"Yanqing Zhang, Haoxing Ren, Akshay Sridharan, Brucek Khailany","doi":"10.1145/3489517.3530601","DOIUrl":"https://doi.org/10.1145/3489517.3530601","url":null,"abstract":"In this paper, we present GATSPI, a novel GPU accelerated logic gate simulator that enables ultra-fast power estimation for industry-sized ASIC designs with millions of gates. GATSPI is written in PyTorch with custom CUDA kernels for ease of coding and maintainability. It achieves simulation kernel speedup of up to 1668X on a single-GPU system and up to 7412X on a multiple-GPU system when compared to a commercial gate-level simulator running on a single CPU core. GATSPI supports a range of simple to complex cell types from an industry standard cell library and SDF conditional delay statements without requiring prior calibration runs and produces industry-standard SAIF files from delay-aware gate-level simulation. Finally, we deploy GATSPI in a glitch-optimization flow, achieving a 1.4% power saving with a 449X speedup in turnaround time compared to a similar flow using a commercial simulator.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114435420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A2-ILT: GPU accelerated ILT with spatial attention mechanism A2-ILT: GPU加速ILT,具有空间注意机制
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530579
Qijing Wang, Bentian Jiang, Martin D. F. Wong, Evangeline F. Y. Young
Inverse lithography technology (ILT) is one of the promising resolution enhancement techniques (RETs) in modern design-for-manufacturing closure, however, it suffers from huge computational overhead and unaffordable mask writing time. In this paper, we propose A2-ILT, a GPU-accelerated ILT framework with spatial attention mechanism. Based on the previous GPU-accelerated ILT flow, we significantly improve the ILT quality by introducing spatial attention map and on-the-fly mask rectilinearization, and strengthen the robustness by Reinforcement-Learning deployment. Experimental results show that, comparing to the state-of-the-art solutions, A2-ILT achieves 5.06% and 11.60% reduction in printing error and process variation band with a lower mask complexity and superior runtime performance.
逆光刻技术(ILT)是现代面向制造的设计封装中很有前途的分辨率增强技术(ret)之一,但它存在巨大的计算开销和难以承受的掩模写入时间。在本文中,我们提出了一个基于gpu加速的空间注意机制ILT框架A2-ILT。在原有gpu加速ILT流的基础上,引入空间注意图和动态掩模线性化,显著提高了ILT质量,并通过强化学习部署增强了鲁棒性。实验结果表明,与现有解决方案相比,A2-ILT的打印误差和工艺变化幅度分别降低了5.06%和11.60%,且掩模复杂度较低,运行时性能优越。
{"title":"A2-ILT: GPU accelerated ILT with spatial attention mechanism","authors":"Qijing Wang, Bentian Jiang, Martin D. F. Wong, Evangeline F. Y. Young","doi":"10.1145/3489517.3530579","DOIUrl":"https://doi.org/10.1145/3489517.3530579","url":null,"abstract":"Inverse lithography technology (ILT) is one of the promising resolution enhancement techniques (RETs) in modern design-for-manufacturing closure, however, it suffers from huge computational overhead and unaffordable mask writing time. In this paper, we propose A2-ILT, a GPU-accelerated ILT framework with spatial attention mechanism. Based on the previous GPU-accelerated ILT flow, we significantly improve the ILT quality by introducing spatial attention map and on-the-fly mask rectilinearization, and strengthen the robustness by Reinforcement-Learning deployment. Experimental results show that, comparing to the state-of-the-art solutions, A2-ILT achieves 5.06% and 11.60% reduction in printing error and process variation band with a lower mask complexity and superior runtime performance.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114437250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
CarM CarM
Pub Date : 2022-07-10 DOI: 10.1163/2330-4804_eiro_com_7474
Soobee Lee, Minindu Weerakoon, Jong-Sung Choi, Minjia Zhang, Di Wang, Myeongjae Jeon
{"title":"CarM","authors":"Soobee Lee, Minindu Weerakoon, Jong-Sung Choi, Minjia Zhang, Di Wang, Myeongjae Jeon","doi":"10.1163/2330-4804_eiro_com_7474","DOIUrl":"https://doi.org/10.1163/2330-4804_eiro_com_7474","url":null,"abstract":"","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129952044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing parallel PREM compilation over nested loop structures 优化嵌套循环结构上的并行PREM编译
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530610
Zhao Gu, R. Pellizzoni
We consider automatic parallelization of a computational kernel executed according to the PRedictable Execution Model (PREM), where each thread is divided into execution and memory phases. We target a scratchpad-based architecture, where memory phases are executed by a dedicated DMA component. We employ data analysis and loop tiling to split the kernel execution into segments, and schedule them based on a DAG representation of data and execution dependencies. Our main observation is that properly selecting tile sizes is key to optimize the makespan of the kernel. We thus propose a heuristic that efficiently searches for optimized tile size and core assignments over deeply nested loops, and demonstrate its applicability and performance compared to the state-of-the-art in PREM compilation using the PolyBench-NN benchmark suite.
我们考虑根据可预测执行模型(PREM)执行的计算内核的自动并行化,其中每个线程分为执行阶段和内存阶段。我们的目标是基于刮擦板的架构,其中内存阶段由专用的DMA组件执行。我们使用数据分析和循环平铺将内核执行分割成段,并根据数据和执行依赖关系的DAG表示对它们进行调度。我们的主要观察是,正确选择tile大小是优化内核makespan的关键。因此,我们提出了一种启发式方法,可以在深度嵌套循环中有效地搜索优化的tile大小和核心分配,并使用PolyBench-NN基准套件与最先进的PREM编译相比,展示其适用性和性能。
{"title":"Optimizing parallel PREM compilation over nested loop structures","authors":"Zhao Gu, R. Pellizzoni","doi":"10.1145/3489517.3530610","DOIUrl":"https://doi.org/10.1145/3489517.3530610","url":null,"abstract":"We consider automatic parallelization of a computational kernel executed according to the PRedictable Execution Model (PREM), where each thread is divided into execution and memory phases. We target a scratchpad-based architecture, where memory phases are executed by a dedicated DMA component. We employ data analysis and loop tiling to split the kernel execution into segments, and schedule them based on a DAG representation of data and execution dependencies. Our main observation is that properly selecting tile sizes is key to optimize the makespan of the kernel. We thus propose a heuristic that efficiently searches for optimized tile size and core assignments over deeply nested loops, and demonstrate its applicability and performance compared to the state-of-the-art in PREM compilation using the PolyBench-NN benchmark suite.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128730530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MATCHA
Pub Date : 2022-07-10 DOI: 10.1145/3489517.3530435
Lei Jiang, Qian Lou, Nrushad Joshi
Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.
{"title":"MATCHA","authors":"Lei Jiang, Qian Lou, Nrushad Joshi","doi":"10.1145/3489517.3530435","DOIUrl":"https://doi.org/10.1145/3489517.3530435","url":null,"abstract":"Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114622566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
Proceedings of the 59th ACM/IEEE Design Automation Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1