Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda
Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.
{"title":"NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized Prefetching","authors":"Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda","doi":"10.1145/3630012","DOIUrl":"https://doi.org/10.1145/3630012","url":null,"abstract":"Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135366784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim
Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.
{"title":"Construction of All Multilayer Monolithic RSMTs and Its Application to Monolithic 3D IC Routing","authors":"Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim","doi":"10.1145/3626958","DOIUrl":"https://doi.org/10.1145/3626958","url":null,"abstract":"Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136211233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar
Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.
{"title":"A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route","authors":"Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar","doi":"10.1145/3626959","DOIUrl":"https://doi.org/10.1145/3626959","url":null,"abstract":"Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng
The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.
{"title":"Yield Optimization for Analog Circuits over Multiple Corners via Bayesian Neural Network: Enhancing Circuit Reliability under Environmental Variation","authors":"Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng","doi":"10.1145/3626321","DOIUrl":"https://doi.org/10.1145/3626321","url":null,"abstract":"The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor
Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.
{"title":"Heterogeneous Integration Supply Chain Integrity through Blockchain and CHSM","authors":"Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor","doi":"10.1145/3625823","DOIUrl":"https://doi.org/10.1145/3625823","url":null,"abstract":"Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel
Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.
{"title":"NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores","authors":"Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel","doi":"10.1145/3626320","DOIUrl":"https://doi.org/10.1145/3626320","url":null,"abstract":"Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135483021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-FPGA systems are widely used in various circuit design-related areas, such as hardware emulation, virtual prototypes, and chiplet design methodologies. However, a physical resource clash between inter-FPGA signals and I/O pins can create a bottleneck in a multi-FPGA system. Specifically, inter-FPGA signals often outnumber I/O pins in a multi-FPGA system. To solve this problem, time-division multiplexing (TDM) is introduced. However, undue time delay caused by TDM may impair the performance of a multi-FPGA system. Therefore, a more efficient TDM solution is needed. In this work, we propose a new routing sequence strategy to improve the efficiency of TDM. Our strategy consists of two parts: a weighted routing algorithm and TDM assignment optimization. The algorithm takes into account the weight of the net to generate a high-quality routing topology. Then, a net-based TDM assignment is performed to obtain a lower TDM ratio for the multi-FPGA system. Experiments on the public dataset of CAD Contest 2019 at ICCAD showed that our routing sequence strategy achieved good results. Especially in those testcases of unbalanced designs, the performance of multi-FPGA systems was improved up to 2.63. Moreover, we outperformed the top two contest finalists as to TDM results in most of the testcases.
{"title":"Sequential Routing-Based Time-Division Multiplexing Optimization for Multi-FPGA Systems","authors":"Wenxiong Lin, Haojie Wu, Peng Gao, Wenjun Luo, Shuting Cai, Xiaoming Xiong","doi":"10.1145/3626322","DOIUrl":"https://doi.org/10.1145/3626322","url":null,"abstract":"Multi-FPGA systems are widely used in various circuit design-related areas, such as hardware emulation, virtual prototypes, and chiplet design methodologies. However, a physical resource clash between inter-FPGA signals and I/O pins can create a bottleneck in a multi-FPGA system. Specifically, inter-FPGA signals often outnumber I/O pins in a multi-FPGA system. To solve this problem, time-division multiplexing (TDM) is introduced. However, undue time delay caused by TDM may impair the performance of a multi-FPGA system. Therefore, a more efficient TDM solution is needed. In this work, we propose a new routing sequence strategy to improve the efficiency of TDM. Our strategy consists of two parts: a weighted routing algorithm and TDM assignment optimization. The algorithm takes into account the weight of the net to generate a high-quality routing topology. Then, a net-based TDM assignment is performed to obtain a lower TDM ratio for the multi-FPGA system. Experiments on the public dataset of CAD Contest 2019 at ICCAD showed that our routing sequence strategy achieved good results. Especially in those testcases of unbalanced designs, the performance of multi-FPGA systems was improved up to 2.63. Moreover, we outperformed the top two contest finalists as to TDM results in most of the testcases.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"435 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135482532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thanks to the enhanced computational capacity of modern computers, even sophisticated analog/RF circuit sizing problems can be solved via electronic design automation (EDA) tools. Recently, several analog/RF circuit optimization algorithms have been successfully applied to automatize the analog/RF circuit design process. Conventionally, metaheuristic algorithms are widely used in optimization process. Among various nature-inspired algorithms, evolutionary algorithms (EAs) have been more preferred due to their superiorities (robustness, efficiency, accuracy etc.) over the other algorithms. Furthermore, EAs have been diversified and several distinguished analog/RF circuit optimization approaches for single-, multi-, and many- objective problems have been reported in the literature. However, there are conflicting claims on the performance of these algorithms and no objective performance comparison has been revealed yet. In the previous work, only a few case study circuits have been under test to demonstrate the superiority of the utilized algorithm, so a limited comparison has been made for only these specific circuits. The underlying reason is that the literature lacks a generic benchmark for analog/RF circuit sizing problem. To address these issues, we propose a comprehensive comparison of the most popular two evolutionary computation algorithms, namely Non-Sorting Genetic Algorithm-II (NSGA-II) and Multi-Objective Evolutionary Algorithm based Decomposition (MOEA/D), in this paper. For that purpose, we introduce two ad-hoc testbenches for analog (ANLG) and radio frequency (RF) circuits including the common building blocks. The comparison has been made at both multi- and many- objective domains and the performances of algorithms have been quantitatively revealed through the well-known Pareto-optimal front quality metrics.
由于现代计算机的计算能力增强,即使是复杂的模拟/射频电路尺寸问题也可以通过电子设计自动化(EDA)工具来解决。近年来,一些模拟/射频电路优化算法已成功地应用于模拟/射频电路设计过程的自动化。传统上,元启发式算法被广泛用于优化过程。在各种受自然启发的算法中,进化算法(EAs)由于其鲁棒性、效率、准确性等优势而受到其他算法的青睐。此外,ea已经多样化,并且文献中已经报道了针对单目标、多目标和多目标问题的几种不同的模拟/射频电路优化方法。然而,对这些算法的性能有相互矛盾的说法,目前还没有发现客观的性能比较。在之前的工作中,只有少数案例研究电路进行了测试,以证明所使用算法的优越性,因此仅对这些特定电路进行了有限的比较。潜在的原因是,文献缺乏模拟/射频电路尺寸问题的通用基准。为了解决这些问题,本文提出了两种最流行的进化计算算法,即非排序遗传算法- ii (NSGA-II)和基于分解的多目标进化算法(MOEA/D)的综合比较。为此,我们介绍了模拟(ANLG)和射频(RF)电路的两个特设测试台,包括常见的构建块。在多目标和多目标领域进行了比较,并通过著名的帕累托最优前端质量指标定量地揭示了算法的性能。
{"title":"MOEA/D vs. NSGA-II: A Comprehensive Comparison for Multi/Many Objective Analog/RF Circuit Optimization Through A Generic Benchmark","authors":"Enes Sağlıcan, Engin Afacan","doi":"10.1145/3626096","DOIUrl":"https://doi.org/10.1145/3626096","url":null,"abstract":"Thanks to the enhanced computational capacity of modern computers, even sophisticated analog/RF circuit sizing problems can be solved via electronic design automation (EDA) tools. Recently, several analog/RF circuit optimization algorithms have been successfully applied to automatize the analog/RF circuit design process. Conventionally, metaheuristic algorithms are widely used in optimization process. Among various nature-inspired algorithms, evolutionary algorithms (EAs) have been more preferred due to their superiorities (robustness, efficiency, accuracy etc.) over the other algorithms. Furthermore, EAs have been diversified and several distinguished analog/RF circuit optimization approaches for single-, multi-, and many- objective problems have been reported in the literature. However, there are conflicting claims on the performance of these algorithms and no objective performance comparison has been revealed yet. In the previous work, only a few case study circuits have been under test to demonstrate the superiority of the utilized algorithm, so a limited comparison has been made for only these specific circuits. The underlying reason is that the literature lacks a generic benchmark for analog/RF circuit sizing problem. To address these issues, we propose a comprehensive comparison of the most popular two evolutionary computation algorithms, namely Non-Sorting Genetic Algorithm-II (NSGA-II) and Multi-Objective Evolutionary Algorithm based Decomposition (MOEA/D), in this paper. For that purpose, we introduce two ad-hoc testbenches for analog (ANLG) and radio frequency (RF) circuits including the common building blocks. The comparison has been made at both multi- and many- objective domains and the performances of algorithms have been quantitatively revealed through the well-known Pareto-optimal front quality metrics.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135385291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Ding, Jinglei Huang, Junpeng Wang, Qi Xu, Song Chen, Yi Kang
Some field programmable gate arrays (FPGAs) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. FPGA-based partially dynamically reconfigurable system (FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS needs to solve the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. Existing works only partly solve problems for the automation process of FPGA-PDRS or model homogeneous resource for FPGA-PDRS. To better solve the problems in the automation process of FPGA-PDRS and narrow the gap between algorithm and application, in this paper, we propose a complete workflow including three parts: pre-processing to generate the lists of task module candidate shapes according to the resource requirements, exploration process to search the solution of task modules partitioning, scheduling, and floorplanning, and post-optimization to improve the floorplan success rate. Experimental results show that, compared with state-of-the-art work, the pre-processing process can reduce the occupied area of task modules by 6% on average; the proposed complete workflow can improve performance by 9.6%, and reduce communication cost by 14.2% with improving the resources reuse rate of the heterogeneous resources on the chip. Based on the solution generated by the exploration process, the post-optimization process can improve the floorplan success rate by 11%.
{"title":"Task modules Partitioning, Scheduling and Floorplanning for Partially Dynamically Reconfigurable Systems with Heterogeneous Resources","authors":"Bo Ding, Jinglei Huang, Junpeng Wang, Qi Xu, Song Chen, Yi Kang","doi":"10.1145/3625295","DOIUrl":"https://doi.org/10.1145/3625295","url":null,"abstract":"Some field programmable gate arrays (FPGAs) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. FPGA-based partially dynamically reconfigurable system (FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS needs to solve the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. Existing works only partly solve problems for the automation process of FPGA-PDRS or model homogeneous resource for FPGA-PDRS. To better solve the problems in the automation process of FPGA-PDRS and narrow the gap between algorithm and application, in this paper, we propose a complete workflow including three parts: pre-processing to generate the lists of task module candidate shapes according to the resource requirements, exploration process to search the solution of task modules partitioning, scheduling, and floorplanning, and post-optimization to improve the floorplan success rate. Experimental results show that, compared with state-of-the-art work, the pre-processing process can reduce the occupied area of task modules by 6% on average; the proposed complete workflow can improve performance by 9.6%, and reduce communication cost by 14.2% with improving the resources reuse rate of the heterogeneous resources on the chip. Based on the solution generated by the exploration process, the post-optimization process can improve the floorplan success rate by 11%.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134958022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphics Processing Units(GPU) are widely used as deep learning accelerators because of its high performance and low power consumption. Additionally, it remains secure against hardware-induced transient fault injection attacks, a classic type of attacks that have been developed on other computing platforms. In this work, we demonstrate that well-trained machine learning models are robust against hardware fault injection attacks when the faults are generated randomly. However, we discover that these models have components, which we refer to as sensitive targets, that are vulnerable to faults. By exploiting this vulnerability, we propose the Lightning attack, which precisely strikes the model’s sensitive targets with hardware-induced transient faults based on the Dynamic Voltage and Frequency Scaling (DVFS). We design a sensitive targets search algorithm to find the most critical processing units of Deep Neural Network(DNN) models determining the inference results, and develop a genetic algorithm to automatically optimize the attack parameters for DVFS to induce faults. Experiments on three commodity Nvidia GPUs for four widely-used DNN models show that the proposed Lightning attack can reduce the inference accuracy by 69.1% on average for non-targeted attacks, and, more interestingly, achieve a success rate of 67.9% for targeted attacks.
{"title":"Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs","authors":"Rihui sun, Pengfei Qiu, Yongqiang Lyu, Jian Dong, Haixia Wang, Dongsheng Wang, Gang Qu","doi":"10.1145/3617893","DOIUrl":"https://doi.org/10.1145/3617893","url":null,"abstract":"Graphics Processing Units(GPU) are widely used as deep learning accelerators because of its high performance and low power consumption. Additionally, it remains secure against hardware-induced transient fault injection attacks, a classic type of attacks that have been developed on other computing platforms. In this work, we demonstrate that well-trained machine learning models are robust against hardware fault injection attacks when the faults are generated randomly. However, we discover that these models have components, which we refer to as sensitive targets, that are vulnerable to faults. By exploiting this vulnerability, we propose the Lightning attack, which precisely strikes the model’s sensitive targets with hardware-induced transient faults based on the Dynamic Voltage and Frequency Scaling (DVFS). We design a sensitive targets search algorithm to find the most critical processing units of Deep Neural Network(DNN) models determining the inference results, and develop a genetic algorithm to automatically optimize the attack parameters for DVFS to induce faults. Experiments on three commodity Nvidia GPUs for four widely-used DNN models show that the proposed Lightning attack can reduce the inference accuracy by 69.1% on average for non-targeted attacks, and, more interestingly, achieve a success rate of 67.9% for targeted attacks.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}