ACM Transactions on Design Automation of Electronic Systems最新文献_第7页

NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized Prefetching 通过定制预取的深度神经网络3D DRAM动态热管理

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-23 DOI: 10.1145/3630012

Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda

Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.

深度神经网络(DNN)的实现通常以庞大的数据集和并发计算为特征，由于处理器和片外存储器之间的密集数据移动，导致对高内存带宽的需求。在通用核心/边缘上执行DNN推理以增强用户体验和减少延迟越来越有吸引力。CPU和传统DRAM速度的不匹配导致计算能力利用率不足，导致推理时间增加。3D DRAM是一种很有前途的解决方案，可以有效地满足高吞吐量深度神经网络的带宽需求。然而，由于堆叠架构中的高功率密度，3D dram需要动态热管理(DTM)，导致内存引起的CPU节流导致性能开销。我们研究了在3D DRAM系统上运行的DNN应用程序的热影响，并提出了一种内存温度感知的定制预取机制，以减少DTM开销并显着提高性能。在我们提出的NeuroCool DTM策略中，我们利用DNN层的特性和访问速率，智能地将DRAM排列或层置于低功耗状态。我们通过训练和测试数据集建立了我们的方法的泛化，这些数据集包括来自广泛使用的深度神经网络应用的不同数据点。在流行的dnn上的实验结果表明，与通用DTM策略相比，NeuroCool的平均性能提高了44%(高达52%)，内存能量提高了43%(高达69%)。

{"title":"NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized Prefetching","authors":"Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda","doi":"10.1145/3630012","DOIUrl":"https://doi.org/10.1145/3630012","url":null,"abstract":"Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135366784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Construction of All Multilayer Monolithic RSMTs and Its Application to Monolithic 3D IC Routing 全多层单片rsmt的构造及其在单片3D IC布线中的应用

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-11 DOI: 10.1145/3626958

Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim

Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.

单片三维(3D)集成允许在单个封装中堆叠超薄硅层。与传统的平面制造技术相比，高密度堆叠技术因其更小的占地面积、更短的波长、更高的性能和更低的功耗而越来越受欢迎。单片3D (M3D)集成电路(ic)的物理设计需要几个设计步骤，如3D放置，3D时钟树合成，3D路由和3D优化。其中，由于路由阻塞数多，3D路由非常耗时。因此，文献中提出的3D路由器插入单片层间通孔(miv)，并分两个子步骤逐层路由。本文提出了一种建立路由拓扑数据库(DB)的算法，该算法用于在三维哈南网格上构造所有多层单片直线斯坦纳最小树(mmrsmt)。为了证明DB在各种应用中的有效性，我们使用DB来构建时间驱动的3D路由拓扑，并在3D设计上执行拥塞感知全局路由。我们期望该算法和DB能够帮助3D路由器减少MIV插入步骤的运行时间，提高3D路由的质量。

{"title":"Construction of All Multilayer Monolithic RSMTs and Its Application to Monolithic 3D IC Routing","authors":"Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim","doi":"10.1145/3626958","DOIUrl":"https://doi.org/10.1145/3626958","url":null,"abstract":"Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136211233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route 一种提高全局路由和详细路由时序一致性的机器学习方法

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-10 DOI: 10.1145/3626959

Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar

Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.

由于在详细路由(DR)之前的设计阶段无法获得路由信息，因此时间预测和优化任务是主要的挑战。不准确的时序预测浪费了设计工作，损害了电路性能，并可能导致设计失败。这项工作的重点是时钟树合成和位置合法化后的时间预测，这是对“完整”网表进行时间和优化的最早机会。本文首先证明，拥有最终后dr寄生的“oracle知识”，可以使后全局路由(GR)优化产生改进的最终定时结果。为了弥补gr后优化过程中基于gr的寄生和定时估计与post-DR结果之间的差距，提出了基于机器学习(ML)的模型，包括使用宏阻塞的特征来准确预测带有宏的设计。通过一组实验评估，表明这些模型比基于gr的时间估计具有更高的精度。当用于后gr优化时，基于ml的模型显示出后dr电路性能的明显改善。该方法应用于两种不同的工具流——OpenROAD和商业工具流——在开源45nm批量和商业12nm FinFET上的结果显示，dr后时间空闲指标得到改善，而不会增加拥堵。这些模型被证明可以推广到在不同时钟周期约束下生成的设计，并且对具有小噪声水平的训练数据具有鲁棒性。

{"title":"A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route","authors":"Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar","doi":"10.1145/3626959","DOIUrl":"https://doi.org/10.1145/3626959","url":null,"abstract":"Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Yield Optimization for Analog Circuits over Multiple Corners via Bayesian Neural Network: Enhancing Circuit Reliability under Environmental Variation 基于贝叶斯神经网络的多拐角模拟电路成品率优化:提高环境变化下电路的可靠性

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-06 DOI: 10.1145/3626321

Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng

The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.

电路的可靠性受到制造过程中的工艺变化和运行过程中的环境变化的显著影响。目前的良率优化算法考虑了工艺变化，以提高电路的可靠性。然而，由于计算成本高，目前的方法往往忽略了环境变化(例如电压和温度变化)的影响。本文提出了一种新颖有效的方法BNN-BYO，用于优化模拟电路在多环境角的良率。首先，利用贝叶斯神经网络(BNN)对多个角点的产量和poi同时进行高效建模;然后，将BNN嵌入到贝叶斯优化框架中，进行多拐角良率优化。由于在BNN模型中隐式编码了产量与不同角点poi之间的相关性，为产量及其不确定性提供了强大的建模能力，提高了产量优化效率。实验结果表明，在达到相同目标良率的情况下，与其他基准方法相比，该方法可节省高达45.3%的仿真成本。此外，在相同的仿真成本下，我们提出的方法可以找到更好的设计点，良率提高3.2%。

{"title":"Yield Optimization for Analog Circuits over Multiple Corners via Bayesian Neural Network: Enhancing Circuit Reliability under Environmental Variation","authors":"Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng","doi":"10.1145/3626321","DOIUrl":"https://doi.org/10.1145/3626321","url":null,"abstract":"The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Integration Supply Chain Integrity through Blockchain and CHSM 基于区块链和CHSM的异构集成供应链完整性

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-06 DOI: 10.1145/3625823

Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor

Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.

在过去的几十年里，电子产品在政府、商业和社会领域已经变得司空见惯。这些设备发展迅速，正如普遍使用的系统芯片(soc)而不是单个电路板上的单独集成电路所看到的那样。随着半导体业界开始讨论摩尔定律的终结，一种通过在一个通用的中间体芯片上使用分离的功能芯片来进一步提高每个区域的功能和产量的方法正受到关注，这种方法被称为系统级封装(SiP)。因此，小晶片和SiP空间已经发展到满足这一需求，创造了一个新的封装范例，先进封装，和一个新的供应链。这种由多个芯片开发商和代工厂组成的新的分布式供应链增加了假冒漏洞。目前在公开市场上可以买到小芯片，因此很难确定它们的来源和真实性。由于缺乏对供应链各个阶段的控制，假冒威胁在小芯片、中间商和SiP级别上表现出来。在本文中，我们识别了SiP域中的假冒威胁，并提出了一个利用区块链对SiP进行有效可追溯性以确定来源的缓解框架。我们的框架利用Chiplet硬件安全模块(CHSM)在SiP的整个生命周期中对其进行身份验证。为了实现这一目标，我们利用SiP信息，包括小芯片的电子芯片id (ECIDs)、抗模具和IC回收(CDIR)传感器信息、文档、测试模式和/或电气测量、等级和SiP的部件号。我们详细介绍了区块链的结构，并建立了将可信信息注册到区块链网络和验证SiP的协议。我们的框架减轻了SiP假冒威胁，包括回收，评论，克隆，过度生产的中间插入物，伪造文件和替换芯片，同时检测不规范和有缺陷的SiP。

{"title":"Heterogeneous Integration Supply Chain Integrity through Blockchain and CHSM","authors":"Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor","doi":"10.1145/3625823","DOIUrl":"https://doi.org/10.1145/3625823","url":null,"abstract":"Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores 基于npu加速的异构多核热优化模拟学习

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-05 DOI: 10.1145/3626320

Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel

Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.

基于用户自定义服务质量(QoS)目标的异构集群多核处理器热优化需要应用迁移和动态电压频率缩放(DVFS)。然而，选择执行每个应用程序的核心和每个集群的电压/频率(V/f)水平是一个复杂的问题，因为1)应用程序的不同特征和QoS目标需要不同的优化，2)每个集群的DVFS需要考虑所有正在运行的应用程序的全局优化。最先进的功率或温度最小化资源管理要么依赖于通常不可用的测量(例如功率)，要么没有考虑优化的所有维度(例如，通过使用简化的分析模型)。为了解决这个问题，可以使用机器学习(ML)方法。特别是，模仿学习(IL)利用oracle策略的最优性，但在低运行时开销的情况下，通过从oracle演示中训练模型。我们是第一个在QoS目标下使用IL来实现温度最小化的。我们通过在设计时训练神经网络(NN)来解决复杂性问题，并使用神经处理单元(NPU)加速神经网络的运行时推理。虽然这种神经网络加速器正变得越来越普遍，但到目前为止，它们只用于加速用户应用。相比之下，我们首次在真实平台上使用现有的加速器来加速基于神经网络的资源管理。为了在我们的目标问题中显示IL与强化学习(RL)相比的优势，我们还开发了基于多智能体强化学习的管理。我们对Arm大的HiKey 970板的评估。LITTLE CPU和NPU表明，IL在可以忽略不计的运行时开销下实现了显著的温度降低。我们将TOP-IL与几种技术进行比较。与随需应变的Linux调控器相比，TOP-IL在对两种技术的QoS违反最小的情况下，将平均温度降低了17°C。与RL策略相比，我们的TOP-IL在产生相似的平均温度的同时减少了63%到89%的QoS违规。此外，TOP-IL在稳定性方面优于RL政策。我们还表明，我们基于il的技术也可以推广到不同的软件(看不见的应用程序)，甚至硬件(不同的冷却)，而不是用于训练。

{"title":"NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores","authors":"Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel","doi":"10.1145/3626320","DOIUrl":"https://doi.org/10.1145/3626320","url":null,"abstract":"Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135483021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sequential Routing-Based Time-Division Multiplexing Optimization for Multi-FPGA Systems 基于时序路由的多fpga系统时分复用优化

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-05 DOI: 10.1145/3626322

Wenxiong Lin, Haojie Wu, Peng Gao, Wenjun Luo, Shuting Cai, Xiaoming Xiong

Multi-FPGA systems are widely used in various circuit design-related areas, such as hardware emulation, virtual prototypes, and chiplet design methodologies. However, a physical resource clash between inter-FPGA signals and I/O pins can create a bottleneck in a multi-FPGA system. Specifically, inter-FPGA signals often outnumber I/O pins in a multi-FPGA system. To solve this problem, time-division multiplexing (TDM) is introduced. However, undue time delay caused by TDM may impair the performance of a multi-FPGA system. Therefore, a more efficient TDM solution is needed. In this work, we propose a new routing sequence strategy to improve the efficiency of TDM. Our strategy consists of two parts: a weighted routing algorithm and TDM assignment optimization. The algorithm takes into account the weight of the net to generate a high-quality routing topology. Then, a net-based TDM assignment is performed to obtain a lower TDM ratio for the multi-FPGA system. Experiments on the public dataset of CAD Contest 2019 at ICCAD showed that our routing sequence strategy achieved good results. Especially in those testcases of unbalanced designs, the performance of multi-FPGA systems was improved up to 2.63. Moreover, we outperformed the top two contest finalists as to TDM results in most of the testcases.

多fpga系统广泛应用于各种电路设计相关领域，如硬件仿真、虚拟原型和芯片设计方法。然而，fpga间信号和I/O引脚之间的物理资源冲突可能会在多fpga系统中造成瓶颈。具体来说，在多fpga系统中，fpga间信号的数量通常超过I/O引脚的数量。为了解决这一问题，引入了时分复用技术(TDM)。然而，时分复用引起的时间延迟可能会影响多fpga系统的性能。因此，需要一种更有效的时分复用解决方案。在这项工作中，我们提出了一种新的路由序列策略来提高时分复用的效率。我们的策略包括两个部分:加权路由算法和TDM分配优化。该算法考虑了网络的权重，生成了高质量的路由拓扑。然后，执行基于网络的时分复用分配，以获得多fpga系统的较低时分复用比。在ICCAD 2019年CAD大赛公开数据集上的实验表明，我们的路由序列策略取得了良好的效果。特别是在不平衡设计的测试用例中，多fpga系统的性能提高到2.63。此外，在大多数测试用例中，我们在TDM结果方面的表现超过了前两名决赛选手。

{"title":"Sequential Routing-Based Time-Division Multiplexing Optimization for Multi-FPGA Systems","authors":"Wenxiong Lin, Haojie Wu, Peng Gao, Wenjun Luo, Shuting Cai, Xiaoming Xiong","doi":"10.1145/3626322","DOIUrl":"https://doi.org/10.1145/3626322","url":null,"abstract":"Multi-FPGA systems are widely used in various circuit design-related areas, such as hardware emulation, virtual prototypes, and chiplet design methodologies. However, a physical resource clash between inter-FPGA signals and I/O pins can create a bottleneck in a multi-FPGA system. Specifically, inter-FPGA signals often outnumber I/O pins in a multi-FPGA system. To solve this problem, time-division multiplexing (TDM) is introduced. However, undue time delay caused by TDM may impair the performance of a multi-FPGA system. Therefore, a more efficient TDM solution is needed. In this work, we propose a new routing sequence strategy to improve the efficiency of TDM. Our strategy consists of two parts: a weighted routing algorithm and TDM assignment optimization. The algorithm takes into account the weight of the net to generate a high-quality routing topology. Then, a net-based TDM assignment is performed to obtain a lower TDM ratio for the multi-FPGA system. Experiments on the public dataset of CAD Contest 2019 at ICCAD showed that our routing sequence strategy achieved good results. Especially in those testcases of unbalanced designs, the performance of multi-FPGA systems was improved up to 2.63. Moreover, we outperformed the top two contest finalists as to TDM results in most of the testcases.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"435 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135482532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MOEA/D vs. NSGA-II: A Comprehensive Comparison for Multi/Many Objective Analog/RF Circuit Optimization Through A Generic Benchmark MOEA/D与NSGA-II:基于通用基准的多/多目标模拟/射频电路优化的综合比较

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-09-28 DOI: 10.1145/3626096

Enes Sağlıcan, Engin Afacan

Thanks to the enhanced computational capacity of modern computers, even sophisticated analog/RF circuit sizing problems can be solved via electronic design automation (EDA) tools. Recently, several analog/RF circuit optimization algorithms have been successfully applied to automatize the analog/RF circuit design process. Conventionally, metaheuristic algorithms are widely used in optimization process. Among various nature-inspired algorithms, evolutionary algorithms (EAs) have been more preferred due to their superiorities (robustness, efficiency, accuracy etc.) over the other algorithms. Furthermore, EAs have been diversified and several distinguished analog/RF circuit optimization approaches for single-, multi-, and many- objective problems have been reported in the literature. However, there are conflicting claims on the performance of these algorithms and no objective performance comparison has been revealed yet. In the previous work, only a few case study circuits have been under test to demonstrate the superiority of the utilized algorithm, so a limited comparison has been made for only these specific circuits. The underlying reason is that the literature lacks a generic benchmark for analog/RF circuit sizing problem. To address these issues, we propose a comprehensive comparison of the most popular two evolutionary computation algorithms, namely Non-Sorting Genetic Algorithm-II (NSGA-II) and Multi-Objective Evolutionary Algorithm based Decomposition (MOEA/D), in this paper. For that purpose, we introduce two ad-hoc testbenches for analog (ANLG) and radio frequency (RF) circuits including the common building blocks. The comparison has been made at both multi- and many- objective domains and the performances of algorithms have been quantitatively revealed through the well-known Pareto-optimal front quality metrics.

由于现代计算机的计算能力增强，即使是复杂的模拟/射频电路尺寸问题也可以通过电子设计自动化(EDA)工具来解决。近年来，一些模拟/射频电路优化算法已成功地应用于模拟/射频电路设计过程的自动化。传统上，元启发式算法被广泛用于优化过程。在各种受自然启发的算法中，进化算法(EAs)由于其鲁棒性、效率、准确性等优势而受到其他算法的青睐。此外，ea已经多样化，并且文献中已经报道了针对单目标、多目标和多目标问题的几种不同的模拟/射频电路优化方法。然而，对这些算法的性能有相互矛盾的说法，目前还没有发现客观的性能比较。在之前的工作中，只有少数案例研究电路进行了测试，以证明所使用算法的优越性，因此仅对这些特定电路进行了有限的比较。潜在的原因是，文献缺乏模拟/射频电路尺寸问题的通用基准。为了解决这些问题，本文提出了两种最流行的进化计算算法，即非排序遗传算法- ii (NSGA-II)和基于分解的多目标进化算法(MOEA/D)的综合比较。为此，我们介绍了模拟(ANLG)和射频(RF)电路的两个特设测试台，包括常见的构建块。在多目标和多目标领域进行了比较，并通过著名的帕累托最优前端质量指标定量地揭示了算法的性能。

{"title":"MOEA/D vs. NSGA-II: A Comprehensive Comparison for Multi/Many Objective Analog/RF Circuit Optimization Through A Generic Benchmark","authors":"Enes Sağlıcan, Engin Afacan","doi":"10.1145/3626096","DOIUrl":"https://doi.org/10.1145/3626096","url":null,"abstract":"Thanks to the enhanced computational capacity of modern computers, even sophisticated analog/RF circuit sizing problems can be solved via electronic design automation (EDA) tools. Recently, several analog/RF circuit optimization algorithms have been successfully applied to automatize the analog/RF circuit design process. Conventionally, metaheuristic algorithms are widely used in optimization process. Among various nature-inspired algorithms, evolutionary algorithms (EAs) have been more preferred due to their superiorities (robustness, efficiency, accuracy etc.) over the other algorithms. Furthermore, EAs have been diversified and several distinguished analog/RF circuit optimization approaches for single-, multi-, and many- objective problems have been reported in the literature. However, there are conflicting claims on the performance of these algorithms and no objective performance comparison has been revealed yet. In the previous work, only a few case study circuits have been under test to demonstrate the superiority of the utilized algorithm, so a limited comparison has been made for only these specific circuits. The underlying reason is that the literature lacks a generic benchmark for analog/RF circuit sizing problem. To address these issues, we propose a comprehensive comparison of the most popular two evolutionary computation algorithms, namely Non-Sorting Genetic Algorithm-II (NSGA-II) and Multi-Objective Evolutionary Algorithm based Decomposition (MOEA/D), in this paper. For that purpose, we introduce two ad-hoc testbenches for analog (ANLG) and radio frequency (RF) circuits including the common building blocks. The comparison has been made at both multi- and many- objective domains and the performances of algorithms have been quantitatively revealed through the well-known Pareto-optimal front quality metrics.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135385291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task modules Partitioning, Scheduling and Floorplanning for Partially Dynamically Reconfigurable Systems with Heterogeneous Resources 具有异构资源的部分动态可重构系统的任务模块划分、调度和楼层规划

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-09-26 DOI: 10.1145/3625295

Bo Ding, Jinglei Huang, Junpeng Wang, Qi Xu, Song Chen, Yi Kang

Some field programmable gate arrays (FPGAs) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. FPGA-based partially dynamically reconfigurable system (FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS needs to solve the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. Existing works only partly solve problems for the automation process of FPGA-PDRS or model homogeneous resource for FPGA-PDRS. To better solve the problems in the automation process of FPGA-PDRS and narrow the gap between algorithm and application, in this paper, we propose a complete workflow including three parts: pre-processing to generate the lists of task module candidate shapes according to the resource requirements, exploration process to search the solution of task modules partitioning, scheduling, and floorplanning, and post-optimization to improve the floorplan success rate. Experimental results show that, compared with state-of-the-art work, the pre-processing process can reduce the occupied area of task modules by 6% on average; the proposed complete workflow can improve performance by 9.6%, and reduce communication cost by 14.2% with improving the resources reuse rate of the heterogeneous resources on the chip. Based on the solution generated by the exploration process, the post-optimization process can improve the floorplan success rate by 11%.

一些现场可编程门阵列(fpga)可以部分动态重构分布在芯片上的异构资源。基于fpga的部分动态可重构系统(partial dynamic reconfigurable system, FPGA-PDRS)可以提高计算速度和计算灵活性。然而，传统的FPGA-PDRS设计是基于手工设计的。实现FPGA-PDRS的自动化需要解决任务模块在异构资源上的划分、调度和布局问题。现有的工作只是部分解决了FPGA-PDRS自动化过程或FPGA-PDRS模型资源同构的问题。为了更好地解决FPGA-PDRS自动化过程中存在的问题，缩小算法与应用之间的差距，本文提出了一个完整的工作流程，包括三个部分:预处理，根据资源需求生成任务模块候选形状列表;探索，搜索任务模块划分、调度和布局的解决方案;后期优化，提高布局成功率。实验结果表明，与现有工作相比，该预处理过程可使任务模块占用面积平均减少6%;提出的完整工作流提高了芯片上异构资源的资源重用率，性能提高了9.6%，通信成本降低了14.2%。基于勘探过程生成的解，后优化过程可将平面图成功率提高11%。

{"title":"Task modules Partitioning, Scheduling and Floorplanning for Partially Dynamically Reconfigurable Systems with Heterogeneous Resources","authors":"Bo Ding, Jinglei Huang, Junpeng Wang, Qi Xu, Song Chen, Yi Kang","doi":"10.1145/3625295","DOIUrl":"https://doi.org/10.1145/3625295","url":null,"abstract":"Some field programmable gate arrays (FPGAs) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. FPGA-based partially dynamically reconfigurable system (FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS needs to solve the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. Existing works only partly solve problems for the automation process of FPGA-PDRS or model homogeneous resource for FPGA-PDRS. To better solve the problems in the automation process of FPGA-PDRS and narrow the gap between algorithm and application, in this paper, we propose a complete workflow including three parts: pre-processing to generate the lists of task module candidate shapes according to the resource requirements, exploration process to search the solution of task modules partitioning, scheduling, and floorplanning, and post-optimization to improve the floorplan success rate. Experimental results show that, compared with state-of-the-art work, the pre-processing process can reduce the occupied area of task modules by 6% on average; the proposed complete workflow can improve performance by 9.6%, and reduce communication cost by 14.2% with improving the resources reuse rate of the heterogeneous resources on the chip. Based on the solution generated by the exploration process, the post-optimization process can improve the floorplan success rate by 11%.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134958022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs

4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-09-20 DOI: 10.1145/3617893

Rihui sun, Pengfei Qiu, Yongqiang Lyu, Jian Dong, Haixia Wang, Dongsheng Wang, Gang Qu

Graphics Processing Units(GPU) are widely used as deep learning accelerators because of its high performance and low power consumption. Additionally, it remains secure against hardware-induced transient fault injection attacks, a classic type of attacks that have been developed on other computing platforms. In this work, we demonstrate that well-trained machine learning models are robust against hardware fault injection attacks when the faults are generated randomly. However, we discover that these models have components, which we refer to as sensitive targets, that are vulnerable to faults. By exploiting this vulnerability, we propose the Lightning attack, which precisely strikes the model’s sensitive targets with hardware-induced transient faults based on the Dynamic Voltage and Frequency Scaling (DVFS). We design a sensitive targets search algorithm to find the most critical processing units of Deep Neural Network(DNN) models determining the inference results, and develop a genetic algorithm to automatically optimize the attack parameters for DVFS to induce faults. Experiments on three commodity Nvidia GPUs for four widely-used DNN models show that the proposed Lightning attack can reduce the inference accuracy by 69.1% on average for non-targeted attacks, and, more interestingly, achieve a success rate of 67.9% for targeted attacks.

此外，它对硬件引起的瞬态故障注入攻击保持安全，这是一种在其他计算平台上开发的经典攻击类型。在这项工作中，我们证明了当故障随机产生时，训练有素的机器学习模型对硬件故障注入攻击具有鲁棒性。然而，我们发现这些模型具有组件，我们将其称为敏感目标，这些组件容易受到故障的影响。利用这一漏洞，我们提出了基于动态电压频率缩放(DVFS)的闪电攻击方法，该方法利用硬件诱导的瞬态故障精确攻击模型的敏感目标。设计了敏感目标搜索算法，寻找深度神经网络(DNN)模型中决定推理结果的最关键处理单元，并开发了遗传算法，自动优化DVFS的攻击参数以诱导故障。在三款Nvidia商用gpu上对四种广泛使用的DNN模型进行的实验表明，提出的闪电攻击对非目标攻击的推理准确率平均降低69.1%，更有趣的是，对目标攻击的成功率达到67.9%。

{"title":"Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs","authors":"Rihui sun, Pengfei Qiu, Yongqiang Lyu, Jian Dong, Haixia Wang, Dongsheng Wang, Gang Qu","doi":"10.1145/3617893","DOIUrl":"https://doi.org/10.1145/3617893","url":null,"abstract":"Graphics Processing Units(GPU) are widely used as deep learning accelerators because of its high performance and low power consumption. Additionally, it remains secure against hardware-induced transient fault injection attacks, a classic type of attacks that have been developed on other computing platforms. In this work, we demonstrate that well-trained machine learning models are robust against hardware fault injection attacks when the faults are generated randomly. However, we discover that these models have components, which we refer to as sensitive targets, that are vulnerable to faults. By exploiting this vulnerability, we propose the Lightning attack, which precisely strikes the model’s sensitive targets with hardware-induced transient faults based on the Dynamic Voltage and Frequency Scaling (DVFS). We design a sensitive targets search algorithm to find the most critical processing units of Deep Neural Network(DNN) models determining the inference results, and develop a genetic algorithm to automatically optimize the attack parameters for DVFS to induce faults. Experiments on three commodity Nvidia GPUs for four widely-used DNN models show that the proposed Lightning attack can reduce the inference accuracy by 69.1% on average for non-targeted attacks, and, more interestingly, achieve a success rate of 67.9% for targeted attacks.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1