ACM Transactions on Design Automation of Electronic Systems最新文献_第8页

Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators 基于reram的CNN加速器交叉棒分配优化数学框架

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-11-02 DOI: 10.1145/3631523

Wanqian Li, Yinhe Han, Xiaoming Chen

The resistive random-access memory (ReRAM) has widely been used to accelerate convolutional neural networks (CNNs) thanks to its analog in-memory computing capability. ReRAM crossbars not only store layers’ weights, but also perform in-situ matrix-vector multiplications which are core operations of CNNs. To boost the performance of ReRAM-based CNN accelerators, crossbars can be duplicated to explore more intra-layer parallelism. The crossbar allocation scheme can significantly influence both the computing throughput and bandwidth requirements of ReRAM-based CNN accelerators. Under the resource constraints (i.e., crossbars and memory bandwidths), how to find the optimal number of crossbars for each layer to maximize the inference performance for an entire CNN is an unsolved problem. In this work, we find the optimal crossbar allocation scheme by mathematically modeling the problem as a constrained optimization problem and solving it with a dynamic programming based solver. Experiments demonstrate that our model for CNN inference time is almost precise, and the proposed framework can obtain solutions with near-optimal inference time. We also emphasize that communication (i.e., data access) is an important factor and must also be considered when determining the optimal crossbar allocation scheme.

电阻式随机存取存储器(ReRAM)由于其模拟内存计算能力而被广泛用于卷积神经网络(cnn)的加速。ReRAM交叉条不仅存储层的权值，还可以进行cnn的核心运算——原位矩阵向量乘法。为了提高基于reram的CNN加速器的性能，可以重复交叉条以探索更多的层内并行性。交叉条分配方案会显著影响基于reram的CNN加速器的计算吞吐量和带宽需求。在资源约束(即交叉条和内存带宽)下，如何找到每层的最优交叉条数量以最大化整个CNN的推理性能是一个尚未解决的问题。本文将该问题建模为一个约束优化问题，并利用基于动态规划的求解器进行求解，从而找到最优的横木分配方案。实验表明，我们的CNN推理时间模型几乎是精确的，所提出的框架可以获得近似最优推理时间的解。我们还强调，通信(即数据访问)是一个重要因素，在确定最佳交叉分配方案时也必须考虑到这一点。

{"title":"Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators","authors":"Wanqian Li, Yinhe Han, Xiaoming Chen","doi":"10.1145/3631523","DOIUrl":"https://doi.org/10.1145/3631523","url":null,"abstract":"The resistive random-access memory (ReRAM) has widely been used to accelerate convolutional neural networks (CNNs) thanks to its analog in-memory computing capability. ReRAM crossbars not only store layers’ weights, but also perform in-situ matrix-vector multiplications which are core operations of CNNs. To boost the performance of ReRAM-based CNN accelerators, crossbars can be duplicated to explore more intra-layer parallelism. The crossbar allocation scheme can significantly influence both the computing throughput and bandwidth requirements of ReRAM-based CNN accelerators. Under the resource constraints (i.e., crossbars and memory bandwidths), how to find the optimal number of crossbars for each layer to maximize the inference performance for an entire CNN is an unsolved problem. In this work, we find the optimal crossbar allocation scheme by mathematically modeling the problem as a constrained optimization problem and solving it with a dynamic programming based solver. Experiments demonstrate that our model for CNN inference time is almost precise, and the proposed framework can obtain solutions with near-optimal inference time. We also emphasize that communication (i.e., data access) is an important factor and must also be considered when determining the optimal crossbar allocation scheme.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135875389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference 基于pim的深度学习推理平台上低开销的最优模型划分

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-11-01 DOI: 10.1145/3628599

Seok Young Kim, Jaewook Lee, Yoonah Paik, Chang Hyun Kim, Won Jun Lee, Seon Wook Kim

Recently Processing-in-Memory (PIM) has become a promising solution to achieve energy-efficient computation in data-intensive applications by placing computation near or inside the memory. In most Deep Learning (DL) frameworks, a user manually partitions a model’s computational graph (CG) onto the computing devices by considering the devices’ capability and the data transfer. The Deep Neural Network (DNN) models become increasingly complex for improving accuracy; thus, it is exceptionally challenging to partition the execution to achieve the best performance, especially on a PIM-based platform requiring frequent offloading of large amounts of data. This paper proposes two novel algorithms for DL inference to resolve the challenge: low-overhead profiling and optimal model partitioning. First, we reconstruct CG by considering the devices’ capability to represent all the possible scheduling paths. Second, we develop a profiling algorithm to find the required minimum profiling paths to measure all the node and edge costs of the reconstructed CG. Finally, we devise the model partitioning algorithm to get the optimal minimum execution time using the dynamic programming technique with the profiled data. We evaluated our work by executing the BERT, RoBERTa, and GPT-2 models on the ARM multicores with the PIM-modeled FPGA platform with various sequence lengths. For three computing devices in the platform, i.e., CPU serial/parallel and PIM executions, we could find all the costs only in four profile runs, three for node costs and one for edge costs. Also, our model partitioning algorithm achieved the highest performance in all the experiments over the execution with manually assigned device priority and the state-of-the-art greedy approach.

最近，内存中处理(PIM)已经成为一种很有前途的解决方案，通过将计算放在内存附近或内存内部来实现数据密集型应用程序中的节能计算。在大多数深度学习(DL)框架中，用户通过考虑设备的能力和数据传输，手动将模型的计算图(CG)划分到计算设备上。为了提高精度，深度神经网络(DNN)模型变得越来越复杂;因此，为实现最佳性能而对执行进行分区是非常具有挑战性的，特别是在需要频繁卸载大量数据的基于pim的平台上。本文提出了两种新的深度学习推理算法:低开销分析和最优模型划分。首先，我们通过考虑设备表示所有可能调度路径的能力来重构CG。其次，我们开发了一种轮廓算法来寻找所需的最小轮廓路径来测量重构CG的所有节点和边缘成本。最后，我们设计了模型划分算法，利用动态规划技术得到了最优的最小执行时间。我们通过在ARM多核上使用pim建模的FPGA平台以不同的序列长度执行BERT、RoBERTa和GPT-2模型来评估我们的工作。对于平台中的三个计算设备，即CPU串行/并行和PIM执行，我们可以在四次配置文件运行中找到所有成本，三次用于节点成本，一次用于边缘成本。此外，我们的模型划分算法在手动分配设备优先级和最先进的贪婪方法的执行过程中取得了最高的性能。

{"title":"Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference","authors":"Seok Young Kim, Jaewook Lee, Yoonah Paik, Chang Hyun Kim, Won Jun Lee, Seon Wook Kim","doi":"10.1145/3628599","DOIUrl":"https://doi.org/10.1145/3628599","url":null,"abstract":"Recently Processing-in-Memory (PIM) has become a promising solution to achieve energy-efficient computation in data-intensive applications by placing computation near or inside the memory. In most Deep Learning (DL) frameworks, a user manually partitions a model’s computational graph (CG) onto the computing devices by considering the devices’ capability and the data transfer. The Deep Neural Network (DNN) models become increasingly complex for improving accuracy; thus, it is exceptionally challenging to partition the execution to achieve the best performance, especially on a PIM-based platform requiring frequent offloading of large amounts of data. This paper proposes two novel algorithms for DL inference to resolve the challenge: low-overhead profiling and optimal model partitioning. First, we reconstruct CG by considering the devices’ capability to represent all the possible scheduling paths. Second, we develop a profiling algorithm to find the required minimum profiling paths to measure all the node and edge costs of the reconstructed CG. Finally, we devise the model partitioning algorithm to get the optimal minimum execution time using the dynamic programming technique with the profiled data. We evaluated our work by executing the BERT, RoBERTa, and GPT-2 models on the ARM multicores with the PIM-modeled FPGA platform with various sequence lengths. For three computing devices in the platform, i.e., CPU serial/parallel and PIM executions, we could find all the costs only in four profile runs, three for node costs and one for edge costs. Also, our model partitioning algorithm achieved the highest performance in all the experiments over the execution with manually assigned device priority and the state-of-the-art greedy approach.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135372069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Security of Electrical, Optical and Wireless On-Chip Interconnects: A Survey 电子、光学和无线片上互连的安全性:综述

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-30 DOI: 10.1145/3631117

Hansika Weerasena, Prabhat Mishra

The advancement of manufacturing technologies has enabled the integration of more intellectual property (IP) cores on the same system-on-chip (SoC). Scalable and high throughput on-chip communication architecture has become a vital component in today’s SoCs. Diverse technologies such as electrical, wireless, optical, and hybrid are available for on-chip communication with different architectures supporting them. On-chip communication sub-system is shared across all the IPs and continuously used throughout the lifetime of the SoC. Therefore, the security of the on-chip communication is crucial because exploiting any vulnerability would be a goldmine for an attacker. In this survey, we provide a comprehensive review of threat models, attacks and countermeasures over diverse on-chip communication technologies as well as sophisticated architectures.

制造技术的进步使得在同一个片上系统(SoC)上集成更多的知识产权(IP)内核成为可能。可扩展和高吞吐量的片上通信架构已成为当今soc的重要组成部分。不同的技术，如电气、无线、光学和混合技术，可用于芯片上的通信，并有不同的架构支持它们。片上通信子系统在所有ip之间共享，并在SoC的整个生命周期中持续使用。因此，片上通信的安全性至关重要，因为利用任何漏洞都将是攻击者的金矿。在本调查中，我们提供了对各种片上通信技术以及复杂架构的威胁模型，攻击和对策的全面回顾。

引用次数: 1

BOOM-Explorer: RISC-V BOOM Microarchitecture Design Space Exploration BOOM- explorer: RISC-V BOOM微架构设计空间探索

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-26 DOI: 10.1145/3630013

Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu, Martin D.F. Wong

Microarchitecture parameters tuning is critical in the microprocessor design cycle. It is a non-trivial design space exploration (DSE) problem due to the large solution space, cycle-accurate simulators’ modeling inaccuracy, and high simulation runtime for performance evaluations. Previous methods require massive expert efforts to construct interpretable equations or high computing resource demands to train black-box prediction models. This paper follows the black-box methods due to better solution qualities than analytical methods in general. We summarize two learned lessons and propose BOOM-Explorer accordingly. First, embedding microarchitecture domain knowledge in the DSE improves the solution quality. Second, BOOM-Explorer makes the microarchitecture DSE for register-transfer-level designs within the limited time budget feasible. We enhance BOOM-Explorer with the diversity-guidance, further improving the algorithm performance. Experimental results with RISC-V Berkeley-Out-of-Order Machine under 7-nm technology show that our proposed methodology achieves an average of (18.75% ) higher Pareto hypervolume, (35.47% ) less average distance to reference set, and (65.38% ) less overall running time compared to previous approaches.

微架构参数调优在微处理器设计周期中是至关重要的。由于求解空间大、循环精度模拟器建模不准确、性能评估的仿真运行时间长等问题，使其成为一个非平凡的设计空间探索问题。以前的方法需要大量的专家努力来构建可解释的方程，或者需要大量的计算资源来训练黑盒预测模型。由于溶液质量优于一般的分析方法，本文采用了黑盒法。我们总结了两个经验教训，并据此提出了BOOM-Explorer。首先，在DSE中嵌入微体系结构领域知识提高了解决方案的质量。其次，BOOM-Explorer使得用于寄存器-传输级设计的微架构DSE在有限的时间预算内可行。我们利用分集制导对BOOM-Explorer进行了改进，进一步提高了算法性能。在RISC-V伯克利无序机7纳米技术下的实验结果表明，与之前的方法相比，我们提出的方法实现了(18.75% )更高的帕雷托超体积，(35.47% )更短的平均参考距离，(65.38% )更短的总运行时间。

{"title":"BOOM-Explorer: RISC-V BOOM Microarchitecture Design Space Exploration","authors":"Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu, Martin D.F. Wong","doi":"10.1145/3630013","DOIUrl":"https://doi.org/10.1145/3630013","url":null,"abstract":"Microarchitecture parameters tuning is critical in the microprocessor design cycle. It is a non-trivial design space exploration (DSE) problem due to the large solution space, cycle-accurate simulators’ modeling inaccuracy, and high simulation runtime for performance evaluations. Previous methods require massive expert efforts to construct interpretable equations or high computing resource demands to train black-box prediction models. This paper follows the black-box methods due to better solution qualities than analytical methods in general. We summarize two learned lessons and propose BOOM-Explorer accordingly. First, embedding microarchitecture domain knowledge in the DSE improves the solution quality. Second, BOOM-Explorer makes the microarchitecture DSE for register-transfer-level designs within the limited time budget feasible. We enhance BOOM-Explorer with the diversity-guidance, further improving the algorithm performance. Experimental results with RISC-V Berkeley-Out-of-Order Machine under 7-nm technology show that our proposed methodology achieves an average of (18.75% ) higher Pareto hypervolume, (35.47% ) less average distance to reference set, and (65.38% ) less overall running time compared to previous approaches.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized Prefetching 通过定制预取的深度神经网络3D DRAM动态热管理

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-23 DOI: 10.1145/3630012

Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda

Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.

深度神经网络(DNN)的实现通常以庞大的数据集和并发计算为特征，由于处理器和片外存储器之间的密集数据移动，导致对高内存带宽的需求。在通用核心/边缘上执行DNN推理以增强用户体验和减少延迟越来越有吸引力。CPU和传统DRAM速度的不匹配导致计算能力利用率不足，导致推理时间增加。3D DRAM是一种很有前途的解决方案，可以有效地满足高吞吐量深度神经网络的带宽需求。然而，由于堆叠架构中的高功率密度，3D dram需要动态热管理(DTM)，导致内存引起的CPU节流导致性能开销。我们研究了在3D DRAM系统上运行的DNN应用程序的热影响，并提出了一种内存温度感知的定制预取机制，以减少DTM开销并显着提高性能。在我们提出的NeuroCool DTM策略中，我们利用DNN层的特性和访问速率，智能地将DRAM排列或层置于低功耗状态。我们通过训练和测试数据集建立了我们的方法的泛化，这些数据集包括来自广泛使用的深度神经网络应用的不同数据点。在流行的dnn上的实验结果表明，与通用DTM策略相比，NeuroCool的平均性能提高了44%(高达52%)，内存能量提高了43%(高达69%)。

{"title":"NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized Prefetching","authors":"Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda","doi":"10.1145/3630012","DOIUrl":"https://doi.org/10.1145/3630012","url":null,"abstract":"Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135366784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Construction of All Multilayer Monolithic RSMTs and Its Application to Monolithic 3D IC Routing 全多层单片rsmt的构造及其在单片3D IC布线中的应用

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-11 DOI: 10.1145/3626958

Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim

Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.

单片三维(3D)集成允许在单个封装中堆叠超薄硅层。与传统的平面制造技术相比，高密度堆叠技术因其更小的占地面积、更短的波长、更高的性能和更低的功耗而越来越受欢迎。单片3D (M3D)集成电路(ic)的物理设计需要几个设计步骤，如3D放置，3D时钟树合成，3D路由和3D优化。其中，由于路由阻塞数多，3D路由非常耗时。因此，文献中提出的3D路由器插入单片层间通孔(miv)，并分两个子步骤逐层路由。本文提出了一种建立路由拓扑数据库(DB)的算法，该算法用于在三维哈南网格上构造所有多层单片直线斯坦纳最小树(mmrsmt)。为了证明DB在各种应用中的有效性，我们使用DB来构建时间驱动的3D路由拓扑，并在3D设计上执行拥塞感知全局路由。我们期望该算法和DB能够帮助3D路由器减少MIV插入步骤的运行时间，提高3D路由的质量。

{"title":"Construction of All Multilayer Monolithic RSMTs and Its Application to Monolithic 3D IC Routing","authors":"Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim","doi":"10.1145/3626958","DOIUrl":"https://doi.org/10.1145/3626958","url":null,"abstract":"Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136211233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route 一种提高全局路由和详细路由时序一致性的机器学习方法

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-10 DOI: 10.1145/3626959

Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar

Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.

由于在详细路由(DR)之前的设计阶段无法获得路由信息，因此时间预测和优化任务是主要的挑战。不准确的时序预测浪费了设计工作，损害了电路性能，并可能导致设计失败。这项工作的重点是时钟树合成和位置合法化后的时间预测，这是对“完整”网表进行时间和优化的最早机会。本文首先证明，拥有最终后dr寄生的“oracle知识”，可以使后全局路由(GR)优化产生改进的最终定时结果。为了弥补gr后优化过程中基于gr的寄生和定时估计与post-DR结果之间的差距，提出了基于机器学习(ML)的模型，包括使用宏阻塞的特征来准确预测带有宏的设计。通过一组实验评估，表明这些模型比基于gr的时间估计具有更高的精度。当用于后gr优化时，基于ml的模型显示出后dr电路性能的明显改善。该方法应用于两种不同的工具流——OpenROAD和商业工具流——在开源45nm批量和商业12nm FinFET上的结果显示，dr后时间空闲指标得到改善，而不会增加拥堵。这些模型被证明可以推广到在不同时钟周期约束下生成的设计，并且对具有小噪声水平的训练数据具有鲁棒性。

{"title":"A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route","authors":"Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar","doi":"10.1145/3626959","DOIUrl":"https://doi.org/10.1145/3626959","url":null,"abstract":"Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Yield Optimization for Analog Circuits over Multiple Corners via Bayesian Neural Network: Enhancing Circuit Reliability under Environmental Variation 基于贝叶斯神经网络的多拐角模拟电路成品率优化:提高环境变化下电路的可靠性

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-06 DOI: 10.1145/3626321

Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng

The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.

电路的可靠性受到制造过程中的工艺变化和运行过程中的环境变化的显著影响。目前的良率优化算法考虑了工艺变化，以提高电路的可靠性。然而，由于计算成本高，目前的方法往往忽略了环境变化(例如电压和温度变化)的影响。本文提出了一种新颖有效的方法BNN-BYO，用于优化模拟电路在多环境角的良率。首先，利用贝叶斯神经网络(BNN)对多个角点的产量和poi同时进行高效建模;然后，将BNN嵌入到贝叶斯优化框架中，进行多拐角良率优化。由于在BNN模型中隐式编码了产量与不同角点poi之间的相关性，为产量及其不确定性提供了强大的建模能力，提高了产量优化效率。实验结果表明，在达到相同目标良率的情况下，与其他基准方法相比，该方法可节省高达45.3%的仿真成本。此外，在相同的仿真成本下，我们提出的方法可以找到更好的设计点，良率提高3.2%。

{"title":"Yield Optimization for Analog Circuits over Multiple Corners via Bayesian Neural Network: Enhancing Circuit Reliability under Environmental Variation","authors":"Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng","doi":"10.1145/3626321","DOIUrl":"https://doi.org/10.1145/3626321","url":null,"abstract":"The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Integration Supply Chain Integrity through Blockchain and CHSM 基于区块链和CHSM的异构集成供应链完整性

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-06 DOI: 10.1145/3625823

Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor

Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.

在过去的几十年里，电子产品在政府、商业和社会领域已经变得司空见惯。这些设备发展迅速，正如普遍使用的系统芯片(soc)而不是单个电路板上的单独集成电路所看到的那样。随着半导体业界开始讨论摩尔定律的终结，一种通过在一个通用的中间体芯片上使用分离的功能芯片来进一步提高每个区域的功能和产量的方法正受到关注，这种方法被称为系统级封装(SiP)。因此，小晶片和SiP空间已经发展到满足这一需求，创造了一个新的封装范例，先进封装，和一个新的供应链。这种由多个芯片开发商和代工厂组成的新的分布式供应链增加了假冒漏洞。目前在公开市场上可以买到小芯片，因此很难确定它们的来源和真实性。由于缺乏对供应链各个阶段的控制，假冒威胁在小芯片、中间商和SiP级别上表现出来。在本文中，我们识别了SiP域中的假冒威胁，并提出了一个利用区块链对SiP进行有效可追溯性以确定来源的缓解框架。我们的框架利用Chiplet硬件安全模块(CHSM)在SiP的整个生命周期中对其进行身份验证。为了实现这一目标，我们利用SiP信息，包括小芯片的电子芯片id (ECIDs)、抗模具和IC回收(CDIR)传感器信息、文档、测试模式和/或电气测量、等级和SiP的部件号。我们详细介绍了区块链的结构，并建立了将可信信息注册到区块链网络和验证SiP的协议。我们的框架减轻了SiP假冒威胁，包括回收，评论，克隆，过度生产的中间插入物，伪造文件和替换芯片，同时检测不规范和有缺陷的SiP。

{"title":"Heterogeneous Integration Supply Chain Integrity through Blockchain and CHSM","authors":"Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor","doi":"10.1145/3625823","DOIUrl":"https://doi.org/10.1145/3625823","url":null,"abstract":"Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores 基于npu加速的异构多核热优化模拟学习

4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2023-10-05 DOI: 10.1145/3626320

Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel

Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.

基于用户自定义服务质量(QoS)目标的异构集群多核处理器热优化需要应用迁移和动态电压频率缩放(DVFS)。然而，选择执行每个应用程序的核心和每个集群的电压/频率(V/f)水平是一个复杂的问题，因为1)应用程序的不同特征和QoS目标需要不同的优化，2)每个集群的DVFS需要考虑所有正在运行的应用程序的全局优化。最先进的功率或温度最小化资源管理要么依赖于通常不可用的测量(例如功率)，要么没有考虑优化的所有维度(例如，通过使用简化的分析模型)。为了解决这个问题，可以使用机器学习(ML)方法。特别是，模仿学习(IL)利用oracle策略的最优性，但在低运行时开销的情况下，通过从oracle演示中训练模型。我们是第一个在QoS目标下使用IL来实现温度最小化的。我们通过在设计时训练神经网络(NN)来解决复杂性问题，并使用神经处理单元(NPU)加速神经网络的运行时推理。虽然这种神经网络加速器正变得越来越普遍，但到目前为止，它们只用于加速用户应用。相比之下，我们首次在真实平台上使用现有的加速器来加速基于神经网络的资源管理。为了在我们的目标问题中显示IL与强化学习(RL)相比的优势，我们还开发了基于多智能体强化学习的管理。我们对Arm大的HiKey 970板的评估。LITTLE CPU和NPU表明，IL在可以忽略不计的运行时开销下实现了显著的温度降低。我们将TOP-IL与几种技术进行比较。与随需应变的Linux调控器相比，TOP-IL在对两种技术的QoS违反最小的情况下，将平均温度降低了17°C。与RL策略相比，我们的TOP-IL在产生相似的平均温度的同时减少了63%到89%的QoS违规。此外，TOP-IL在稳定性方面优于RL政策。我们还表明，我们基于il的技术也可以推广到不同的软件(看不见的应用程序)，甚至硬件(不同的冷却)，而不是用于训练。

{"title":"NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores","authors":"Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel","doi":"10.1145/3626320","DOIUrl":"https://doi.org/10.1145/3626320","url":null,"abstract":"Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135483021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0