The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.
{"title":"Squeezing Accumulators in Binary Neural Networks for Extremely Resource-Constrained Applications","authors":"Azat Azamat, Jaewoo Park, Jongeun Lee","doi":"10.1145/3508352.3549418","DOIUrl":"https://doi.org/10.1145/3508352.3549418","url":null,"abstract":"The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"62 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114034486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Gong, Zheng Zhan, Pu Zhao, Yushu Wu, Chaoan Wu, Caiwen Ding, Weiwen Jiang, Minghai Qin, Yanzhi Wang
During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.
{"title":"All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management","authors":"Yifan Gong, Zheng Zhan, Pu Zhao, Yushu Wu, Chaoan Wu, Caiwen Ding, Weiwen Jiang, Minghai Qin, Yanzhi Wang","doi":"10.1145/3508352.3549379","DOIUrl":"https://doi.org/10.1145/3508352.3549379","url":null,"abstract":"During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.
{"title":"Superfast Full-Scale GPU-Accelerated Global Routing","authors":"Shiju Lin, Martin D. F. Wong","doi":"10.1145/3508352.3549474","DOIUrl":"https://doi.org/10.1145/3508352.3549474","url":null,"abstract":"Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128652064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Schober, Seyedeh Newsha Estiri, Sercan Aygün, N. Taherinejad, M. Najafi
Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.
{"title":"Sound Source Localization using Stochastic Computing","authors":"Peter Schober, Seyedeh Newsha Estiri, Sercan Aygün, N. Taherinejad, M. Najafi","doi":"10.1145/3508352.3549373","DOIUrl":"https://doi.org/10.1145/3508352.3549373","url":null,"abstract":"Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130591209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.
{"title":"Speculative Load Forwarding Attack on Modern Processors","authors":"Hasini Witharana, P. Mishra","doi":"10.1145/3508352.3549417","DOIUrl":"https://doi.org/10.1145/3508352.3549417","url":null,"abstract":"Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116022054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prianka Sengupta, Aakash Tyagi, Yiran Chen, Jiangkun Hu
Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.
{"title":"How Good Is Your Verilog RTL Code? A Quick Answer from Machine Learning","authors":"Prianka Sengupta, Aakash Tyagi, Yiran Chen, Jiangkun Hu","doi":"10.1145/3508352.3549375","DOIUrl":"https://doi.org/10.1145/3508352.3549375","url":null,"abstract":"Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"549 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by reducing the data transfer activities in dataintensive neural network computing, SRAM-based compute-inmemory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that "hides" ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, HiddenROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.
{"title":"Hidden-ROM: A Compute-in-ROM Architecture to Deploy Large-Scale Neural Networks on Chip with Flexible and Scalable Post-Fabrication Task Transfer Capability","authors":"Yiming Chen, Guodong Yin, Ming-En Lee, Wenjun Tang, Zekun Yang, Yongpan Liu, Huazhong Yang, Xueqing Li","doi":"10.1145/3508352.3549335","DOIUrl":"https://doi.org/10.1145/3508352.3549335","url":null,"abstract":"Motivated by reducing the data transfer activities in dataintensive neural network computing, SRAM-based compute-inmemory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that \"hides\" ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, HiddenROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127707885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.
{"title":"A Robust Global Routing Engine with High-accuracy Cell Movement under Advanced Constraints","authors":"Ziran Zhu, Fuheng Shen, Yangjie Mei, Zhipeng Huang, Jianli Chen, Jun-Zhi Yang","doi":"10.1145/3508352.3549421","DOIUrl":"https://doi.org/10.1145/3508352.3549421","url":null,"abstract":"Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132437142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyeonghyeon Baek, Hyunbum Park, Suwan Kim, Kyumyung Choi, Taewhan Kim
An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.
{"title":"Pin Accessibility and Routing Congestion Aware DRC Hotspot Prediction using Graph Neural Network and U-Net","authors":"Kyeonghyeon Baek, Hyunbum Park, Suwan Kim, Kyumyung Choi, Taewhan Kim","doi":"10.1145/3508352.3549346","DOIUrl":"https://doi.org/10.1145/3508352.3549346","url":null,"abstract":"An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130532575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.
{"title":"On Minimizing the Read Latency of Flash Memory to Preserve Inter-tree Locality in Random Forest","authors":"Yu-Cheng Lin, Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, W. Shih","doi":"10.1145/3508352.3549365","DOIUrl":"https://doi.org/10.1145/3508352.3549365","url":null,"abstract":"Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125364534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}