The resistive random-access memory (ReRAM) has widely been used to accelerate convolutional neural networks (CNNs) thanks to its analog in-memory computing capability. ReRAM crossbars not only store layers’ weights, but also perform in-situ matrix-vector multiplications which are core operations of CNNs. To boost the performance of ReRAM-based CNN accelerators, crossbars can be duplicated to explore more intra-layer parallelism. The crossbar allocation scheme can significantly influence both the computing throughput and bandwidth requirements of ReRAM-based CNN accelerators. Under the resource constraints (i.e., crossbars and memory bandwidths), how to find the optimal number of crossbars for each layer to maximize the inference performance for an entire CNN is an unsolved problem. In this work, we find the optimal crossbar allocation scheme by mathematically modeling the problem as a constrained optimization problem and solving it with a dynamic programming based solver. Experiments demonstrate that our model for CNN inference time is almost precise, and the proposed framework can obtain solutions with near-optimal inference time. We also emphasize that communication (i.e., data access) is an important factor and must also be considered when determining the optimal crossbar allocation scheme.
{"title":"Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators","authors":"Wanqian Li, Yinhe Han, Xiaoming Chen","doi":"10.1145/3631523","DOIUrl":"https://doi.org/10.1145/3631523","url":null,"abstract":"The resistive random-access memory (ReRAM) has widely been used to accelerate convolutional neural networks (CNNs) thanks to its analog in-memory computing capability. ReRAM crossbars not only store layers’ weights, but also perform in-situ matrix-vector multiplications which are core operations of CNNs. To boost the performance of ReRAM-based CNN accelerators, crossbars can be duplicated to explore more intra-layer parallelism. The crossbar allocation scheme can significantly influence both the computing throughput and bandwidth requirements of ReRAM-based CNN accelerators. Under the resource constraints (i.e., crossbars and memory bandwidths), how to find the optimal number of crossbars for each layer to maximize the inference performance for an entire CNN is an unsolved problem. In this work, we find the optimal crossbar allocation scheme by mathematically modeling the problem as a constrained optimization problem and solving it with a dynamic programming based solver. Experiments demonstrate that our model for CNN inference time is almost precise, and the proposed framework can obtain solutions with near-optimal inference time. We also emphasize that communication (i.e., data access) is an important factor and must also be considered when determining the optimal crossbar allocation scheme.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135875389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seok Young Kim, Jaewook Lee, Yoonah Paik, Chang Hyun Kim, Won Jun Lee, Seon Wook Kim
Recently Processing-in-Memory (PIM) has become a promising solution to achieve energy-efficient computation in data-intensive applications by placing computation near or inside the memory. In most Deep Learning (DL) frameworks, a user manually partitions a model’s computational graph (CG) onto the computing devices by considering the devices’ capability and the data transfer. The Deep Neural Network (DNN) models become increasingly complex for improving accuracy; thus, it is exceptionally challenging to partition the execution to achieve the best performance, especially on a PIM-based platform requiring frequent offloading of large amounts of data. This paper proposes two novel algorithms for DL inference to resolve the challenge: low-overhead profiling and optimal model partitioning. First, we reconstruct CG by considering the devices’ capability to represent all the possible scheduling paths. Second, we develop a profiling algorithm to find the required minimum profiling paths to measure all the node and edge costs of the reconstructed CG. Finally, we devise the model partitioning algorithm to get the optimal minimum execution time using the dynamic programming technique with the profiled data. We evaluated our work by executing the BERT, RoBERTa, and GPT-2 models on the ARM multicores with the PIM-modeled FPGA platform with various sequence lengths. For three computing devices in the platform, i.e., CPU serial/parallel and PIM executions, we could find all the costs only in four profile runs, three for node costs and one for edge costs. Also, our model partitioning algorithm achieved the highest performance in all the experiments over the execution with manually assigned device priority and the state-of-the-art greedy approach.
{"title":"Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference","authors":"Seok Young Kim, Jaewook Lee, Yoonah Paik, Chang Hyun Kim, Won Jun Lee, Seon Wook Kim","doi":"10.1145/3628599","DOIUrl":"https://doi.org/10.1145/3628599","url":null,"abstract":"Recently Processing-in-Memory (PIM) has become a promising solution to achieve energy-efficient computation in data-intensive applications by placing computation near or inside the memory. In most Deep Learning (DL) frameworks, a user manually partitions a model’s computational graph (CG) onto the computing devices by considering the devices’ capability and the data transfer. The Deep Neural Network (DNN) models become increasingly complex for improving accuracy; thus, it is exceptionally challenging to partition the execution to achieve the best performance, especially on a PIM-based platform requiring frequent offloading of large amounts of data. This paper proposes two novel algorithms for DL inference to resolve the challenge: low-overhead profiling and optimal model partitioning. First, we reconstruct CG by considering the devices’ capability to represent all the possible scheduling paths. Second, we develop a profiling algorithm to find the required minimum profiling paths to measure all the node and edge costs of the reconstructed CG. Finally, we devise the model partitioning algorithm to get the optimal minimum execution time using the dynamic programming technique with the profiled data. We evaluated our work by executing the BERT, RoBERTa, and GPT-2 models on the ARM multicores with the PIM-modeled FPGA platform with various sequence lengths. For three computing devices in the platform, i.e., CPU serial/parallel and PIM executions, we could find all the costs only in four profile runs, three for node costs and one for edge costs. Also, our model partitioning algorithm achieved the highest performance in all the experiments over the execution with manually assigned device priority and the state-of-the-art greedy approach.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135372069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advancement of manufacturing technologies has enabled the integration of more intellectual property (IP) cores on the same system-on-chip (SoC). Scalable and high throughput on-chip communication architecture has become a vital component in today’s SoCs. Diverse technologies such as electrical, wireless, optical, and hybrid are available for on-chip communication with different architectures supporting them. On-chip communication sub-system is shared across all the IPs and continuously used throughout the lifetime of the SoC. Therefore, the security of the on-chip communication is crucial because exploiting any vulnerability would be a goldmine for an attacker. In this survey, we provide a comprehensive review of threat models, attacks and countermeasures over diverse on-chip communication technologies as well as sophisticated architectures.
{"title":"Security of Electrical, Optical and Wireless On-Chip Interconnects: A Survey","authors":"Hansika Weerasena, Prabhat Mishra","doi":"10.1145/3631117","DOIUrl":"https://doi.org/10.1145/3631117","url":null,"abstract":"The advancement of manufacturing technologies has enabled the integration of more intellectual property (IP) cores on the same system-on-chip (SoC). Scalable and high throughput on-chip communication architecture has become a vital component in today’s SoCs. Diverse technologies such as electrical, wireless, optical, and hybrid are available for on-chip communication with different architectures supporting them. On-chip communication sub-system is shared across all the IPs and continuously used throughout the lifetime of the SoC. Therefore, the security of the on-chip communication is crucial because exploiting any vulnerability would be a goldmine for an attacker. In this survey, we provide a comprehensive review of threat models, attacks and countermeasures over diverse on-chip communication technologies as well as sophisticated architectures.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136019604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu, Martin D.F. Wong
Microarchitecture parameters tuning is critical in the microprocessor design cycle. It is a non-trivial design space exploration (DSE) problem due to the large solution space, cycle-accurate simulators’ modeling inaccuracy, and high simulation runtime for performance evaluations. Previous methods require massive expert efforts to construct interpretable equations or high computing resource demands to train black-box prediction models. This paper follows the black-box methods due to better solution qualities than analytical methods in general. We summarize two learned lessons and propose BOOM-Explorer accordingly. First, embedding microarchitecture domain knowledge in the DSE improves the solution quality. Second, BOOM-Explorer makes the microarchitecture DSE for register-transfer-level designs within the limited time budget feasible. We enhance BOOM-Explorer with the diversity-guidance, further improving the algorithm performance. Experimental results with RISC-V Berkeley-Out-of-Order Machine under 7-nm technology show that our proposed methodology achieves an average of (18.75% ) higher Pareto hypervolume, (35.47% ) less average distance to reference set, and (65.38% ) less overall running time compared to previous approaches.
{"title":"BOOM-Explorer: RISC-V BOOM Microarchitecture Design Space Exploration","authors":"Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu, Martin D.F. Wong","doi":"10.1145/3630013","DOIUrl":"https://doi.org/10.1145/3630013","url":null,"abstract":"Microarchitecture parameters tuning is critical in the microprocessor design cycle. It is a non-trivial design space exploration (DSE) problem due to the large solution space, cycle-accurate simulators’ modeling inaccuracy, and high simulation runtime for performance evaluations. Previous methods require massive expert efforts to construct interpretable equations or high computing resource demands to train black-box prediction models. This paper follows the black-box methods due to better solution qualities than analytical methods in general. We summarize two learned lessons and propose BOOM-Explorer accordingly. First, embedding microarchitecture domain knowledge in the DSE improves the solution quality. Second, BOOM-Explorer makes the microarchitecture DSE for register-transfer-level designs within the limited time budget feasible. We enhance BOOM-Explorer with the diversity-guidance, further improving the algorithm performance. Experimental results with RISC-V Berkeley-Out-of-Order Machine under 7-nm technology show that our proposed methodology achieves an average of (18.75% ) higher Pareto hypervolume, (35.47% ) less average distance to reference set, and (65.38% ) less overall running time compared to previous approaches.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda
Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.
{"title":"NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized Prefetching","authors":"Shailja Pandey, Lokesh Siddhu, Preeti Ranjan Panda","doi":"10.1145/3630012","DOIUrl":"https://doi.org/10.1145/3630012","url":null,"abstract":"Deep neural network (DNN) implementations are typically characterized by huge data sets and concurrent computation, resulting in a demand for high memory bandwidth due to intensive data movement between processors and off-chip memory. Performing DNN inference on general-purpose cores/ edge is gaining attraction to enhance user experience and reduce latency. The mismatch in the CPU and conventional DRAM speed leads to under utilization of the compute capabilities, causing increased inference time. 3D DRAM is a promising solution to effectively fulfill the bandwidth requirement of high-throughput DNNs. However, due to high power density in stacked architectures, 3D DRAMs need dynamic thermal management (DTM), resulting in performance overhead due to memory-induced CPU throttling. We study the thermal impact of DNN applications running on a 3D DRAM system, and make a case for a memory temperature-aware customized prefetch mechanism to reduce DTM overheads and significantly improve performance. In our proposed NeuroCool DTM policy, we intelligently place either DRAM ranks or tiers in low power state, using the DNN layer characteristics and access rate. We establish the generalization of our approach through training and test data sets comprising diverse data points from widely used DNN applications. Experimental results on popular DNNs show that NeuroCool results in a average performance gain of 44% (as high as 52%) and memory energy improvement of 43% (as high as 69%) over general-purpose DTM policies.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135366784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim
Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.
{"title":"Construction of All Multilayer Monolithic RSMTs and Its Application to Monolithic 3D IC Routing","authors":"Monzurul Islam Dewan, Sheng-En David Lin, Dae Hyun Kim","doi":"10.1145/3626958","DOIUrl":"https://doi.org/10.1145/3626958","url":null,"abstract":"Monolithic three-dimensional (3D) integration allows ultra-thin silicon tier stacking in a single package. The high-density stacking is acquiring interest and is becoming more popular for smaller footprint areas, shorter wirelength, higher performance, and lower power consumption than the conventional planar fabrication technologies. The physical design of monolithic 3D (M3D) integrated circuits (ICs) requires several design steps such as 3D placement, 3D clock-tree synthesis, 3D routing, and 3D optimization. Among these, 3D routing is significantly time-consuming due to countless routing blockages. Therefore, 3D routers proposed in the literature insert monolithic inter-layer vias (MIVs) and perform tier-by-tier routing in two sub-steps. In this paper, we propose an algorithm to build a routing topology database (DB) used to construct all multilayer monolithic rectilinear Steiner minimum trees (MMRSMTs) on the 3D Hanan grid. To demonstrate the effectiveness of the DB in various applications, we use the DB to construct timing-driven 3D routing topologies and perform congestion-aware global routing on 3D designs. We anticipate that the algorithm and the DB will help 3D routers reduce the runtime of the MIV insertion step and improve the quality of the 3D routing.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136211233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar
Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.
{"title":"A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route","authors":"Vidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng, Sachin S. Sapatnekar","doi":"10.1145/3626959","DOIUrl":"https://doi.org/10.1145/3626959","url":null,"abstract":"Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization , machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows – OpenROAD and a commercial tool flow – and results on an open-source 45nm bulk and a commercial 12nm FinFET enablement show improvements in post-DR timing slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136353525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng
The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.
{"title":"Yield Optimization for Analog Circuits over Multiple Corners via Bayesian Neural Network: Enhancing Circuit Reliability under Environmental Variation","authors":"Nanlin Guo, Fulin Peng, Jiahe Shi, Fan Yang, Jun Tao, Xuan Zeng","doi":"10.1145/3626321","DOIUrl":"https://doi.org/10.1145/3626321","url":null,"abstract":"The reliability of circuits is significantly affected by process variations in manufacturing and environmental variation during operation. Current yield optimization algorithms take process variations into consideration to improve circuit reliability. However, the influence of environmental variations (e.g., voltage and temperature variations) is often ignored in current methods because of the high computational cost. In this paper, a novel and efficient approach named BNN-BYO is proposed to optimize the yield of analog circuits in multiple environmental corners. First, we use a Bayesian Neural Network (BNN) to simultaneously model the yields and POIs in multiple corners efficiently. Next, the multi-corner yield optimization can be performed by embedding BNN into Bayesian optimization framework. Since the correlation among yields and POIs in different corners is implicitly encoded in the BNN model, it provides great modeling capabilities for yields and their uncertainties to improve the efficiency of yield optimization. Our experimental results demonstrate that the proposed method can save up to 45.3% of simulation cost compared to other baseline methods to achieve the same target yield. In addition, for the same simulation cost, our proposed method can find better design points with 3.2% yield improvement.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor
Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.
{"title":"Heterogeneous Integration Supply Chain Integrity through Blockchain and CHSM","authors":"Paul E. Calzada, Md Sami Ul Islam Sami, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, Mark Tehranipoor","doi":"10.1145/3625823","DOIUrl":"https://doi.org/10.1145/3625823","url":null,"abstract":"Over the past few decades, electronics have become commonplace in government, commercial, and social domains. These devices have developed rapidly, as seen in the prevalent use of System on Chips (SoCs) rather than separate integrated circuits on a single circuit board. As the semiconductor community begins conversations over the end of Moore’s Law, an approach to further increase both functionality per area and yield using segregated functionality dies on a common interposer die, labeled a System in Package (SiP), is gaining attention. Thus, the chiplet and SiP space has grown to meet this demand, creating a new packaging paradigm, Advanced Packaging, and a new supply chain. This new distributed supply chain with multiple chiplet developers and foundries has augmented counterfeit vulnerabilities. Chiplets are currently available on an open market, and their origin and authenticity consequently are difficult to ascertain. With this lack of control over the stages of the supply chain, counterfeit threats manifest at the chiplet, interposer, and SiP levels. In this paper, we identify counterfeit threats in the SiP domain, and we propose a mitigating framework utilizing blockchain for the effective traceability of SiPs to establish provenance. Our framework utilizes the Chiplet Hardware Security Module (CHSM) to authenticate a SiP throughout its life. To accomplish this, we leverage SiP information including Electronic Chip IDs (ECIDs) of chiplets, Combating Die and IC Recycling (CDIR) sensor information, documentation, test patterns and/or electrical measurements, grade, and part number of the SiP. We detail the structure of the blockchain and establish protocols for both enrolling trusted information into the blockchain network and authenticating the SiP. Our framework mitigates SiP counterfeit threats including recycled, remarked, cloned, overproduced interposer, forged documentation, and substituted chiplet while detecting of out-of-spec and defective SiPs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135347475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel
Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.
{"title":"NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores","authors":"Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel","doi":"10.1145/3626320","DOIUrl":"https://doi.org/10.1145/3626320","url":null,"abstract":"Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135483021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}