Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643462
Emmanuel Pescosta, Georg Weissenbacher, Florian Zuleger
Spectre, a hardware vulnerability that breaks the isolation between applications, has received ample attention in recent years. Spectre-style attacks exploit speculative execution to leak information through micro-architectural side-channels, breaking down abstractions software developers relied on for decades. As these attacks are based on fundamental optimization techniques present in most modern micro-processors, salvation seems to lie in software-based countermeasures for now. Comprehensive software mitigation, however, has proved to be an exceptionally challenging task with ample of room for failure. To support the automated analysis of mitigation attempts, we present a technique that relies on Bounded Model Checking to detect violations of non-interference in speculative executions. Since off-the-shelf software model checking tools are nescient of micro-architectural state, we base our effort on an operational semantics of speculative executions of micro-assembly code. Our semantics is parameterized with micro-architectural components (such as the cache or the branch predictor), allowing for precise models of various side-channels. We evaluate our approach on widely used benchmark instances, report the detection of a zeroday vulnerability in the Linux kernel, and demonstrate that our approach is more exhaustive than symbolic simulation (with comparable computational effort).
Spectre是一种破坏应用程序之间隔离的硬件漏洞,近年来受到了广泛关注。幽灵式攻击利用推测性执行,通过微架构侧通道泄露信息,破坏软件开发人员几十年来所依赖的抽象。由于这些攻击基于大多数现代微处理器中存在的基本优化技术,因此目前的解决方案似乎在于基于软件的对策。然而,全面的软件缓解已被证明是一项极具挑战性的任务,有足够的失败空间。为了支持对缓解尝试的自动分析,我们提出了一种依赖于有界模型检查(Bounded Model Checking)的技术,以检测推测执行中违反非干扰的情况。由于现成的软件模型检查工具不了解微体系结构状态,我们的工作基于微汇编代码推测执行的操作语义。我们的语义是用微体系结构组件(如缓存或分支预测器)参数化的,允许各种侧通道的精确模型。我们在广泛使用的基准实例上评估了我们的方法,报告了在Linux内核中检测到的零日漏洞,并证明了我们的方法比符号模拟更详尽(计算工作量相当)。
{"title":"Bounded Model Checking of Speculative Non-Interference","authors":"Emmanuel Pescosta, Georg Weissenbacher, Florian Zuleger","doi":"10.1109/ICCAD51958.2021.9643462","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643462","url":null,"abstract":"Spectre, a hardware vulnerability that breaks the isolation between applications, has received ample attention in recent years. Spectre-style attacks exploit speculative execution to leak information through micro-architectural side-channels, breaking down abstractions software developers relied on for decades. As these attacks are based on fundamental optimization techniques present in most modern micro-processors, salvation seems to lie in software-based countermeasures for now. Comprehensive software mitigation, however, has proved to be an exceptionally challenging task with ample of room for failure. To support the automated analysis of mitigation attempts, we present a technique that relies on Bounded Model Checking to detect violations of non-interference in speculative executions. Since off-the-shelf software model checking tools are nescient of micro-architectural state, we base our effort on an operational semantics of speculative executions of micro-assembly code. Our semantics is parameterized with micro-architectural components (such as the cache or the branch predictor), allowing for precise models of various side-channels. We evaluate our approach on widely used benchmark instances, report the detection of a zeroday vulnerability in the Linux kernel, and demonstrate that our approach is more exhaustive than symbolic simulation (with comparable computational effort).","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126460258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643485
Hassan Nassar, Hanna AlZughbi, Dennis R. E. Gnad, L. Bauer, M. Tahoori, J. Henkel
FPGAs are being offered in the cloud as accelerator resources that can be shared among multiple users (i.e. tenants). Recently, various approaches have shown that fault attacks launched from one tenant region to another are possible, leading to timing faults or crashes of the FPGA. It is, therefore, important that malicious tenants are limited in their ability to cause such security problems. So far, the existing countermeasures against such attacks check the configuration bitstreams before they are reconfigured. Such offline approaches have various practical limitations, e.g. they may force the tenants to unveil their design secrets. In this paper, we present LoopBreaker, a novel runtime solution that can disable the entire activity of a malicious tenant region, in order to rapidly stop a potential attack before it results in a crash (i.e. Denial-of-Service). We implemented and tested multiple attack types and found that realistic attacks demand at least 12–26 µs to be successful. A partial reconfiguration to overwrite the malicious tenant region demands 200 µs in our realworld implementation, which is too slow to prevent the attack from leading to a crash. Instead, our proposed LoopBreaker method only needs 1.5 µs to stop a malicious tenant, which makes it the first online approach that can successfully stop challenging voltage drop-based attacks from causing a crash.
{"title":"LoopBreaker: Disabling Interconnects to Mitigate Voltage-Based Attacks in Multi-Tenant FPGAs","authors":"Hassan Nassar, Hanna AlZughbi, Dennis R. E. Gnad, L. Bauer, M. Tahoori, J. Henkel","doi":"10.1109/ICCAD51958.2021.9643485","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643485","url":null,"abstract":"FPGAs are being offered in the cloud as accelerator resources that can be shared among multiple users (i.e. tenants). Recently, various approaches have shown that fault attacks launched from one tenant region to another are possible, leading to timing faults or crashes of the FPGA. It is, therefore, important that malicious tenants are limited in their ability to cause such security problems. So far, the existing countermeasures against such attacks check the configuration bitstreams before they are reconfigured. Such offline approaches have various practical limitations, e.g. they may force the tenants to unveil their design secrets. In this paper, we present LoopBreaker, a novel runtime solution that can disable the entire activity of a malicious tenant region, in order to rapidly stop a potential attack before it results in a crash (i.e. Denial-of-Service). We implemented and tested multiple attack types and found that realistic attacks demand at least 12–26 µs to be successful. A partial reconfiguration to overwrite the malicious tenant region demands 200 µs in our realworld implementation, which is too slow to prevent the attack from leading to a crash. Instead, our proposed LoopBreaker method only needs 1.5 µs to stop a malicious tenant, which makes it the first online approach that can successfully stop challenging voltage drop-based attacks from causing a crash.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125294802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To meet the ever-increasing requirements of on-chip communication, the trend is towards wavelength-routed optical networks-on-chip (WRONoCs), which support high-speed communication with low power. A typical WRONoC design flow consists of two consecutive steps: topological design and physical design. Current physical design tools interpret the input topology as a pure logic scheme and perform placement and routing for all network components from scratch. Due to the large design complexity and the layout constraints, additional waveguide crossings in the synthesized layouts are hardly avoidable, which results in an increase in insertion loss and crosstalk noise and thus degrades the network performance. In this work, we propose a physical design tool, ToPro, which retains the interconnection among the optical switching elements by projecting the structure of a WRONoC topology onto the physical plane, and focuses on the waveguide routing to the IP-cores. To avoid the increase in insertion loss and crosstalk noise, ToPro removes the extra crossings and long detours of waveguides by changing the routing order of nets. The experimental results demonstrate the superiority of ToPro in time- and energy-efficiency. For example, compared to a state-of-the-art design automation tool, ToPro synthesizes a network with 16 IP-cores with a 17% reduction on the worst-case insertion loss and decreases the synthesis time from more than six days to less than one second.
{"title":"ToPro: A Topology Projector and Waveguide Router for Wavelength-Routed Optical Networks-on-Chip","authors":"Zhidan Zheng, Mengchu Li, Tsun-Ming Tseng, Ulf Schlichtmann","doi":"10.1109/ICCAD51958.2021.9643451","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643451","url":null,"abstract":"To meet the ever-increasing requirements of on-chip communication, the trend is towards wavelength-routed optical networks-on-chip (WRONoCs), which support high-speed communication with low power. A typical WRONoC design flow consists of two consecutive steps: topological design and physical design. Current physical design tools interpret the input topology as a pure logic scheme and perform placement and routing for all network components from scratch. Due to the large design complexity and the layout constraints, additional waveguide crossings in the synthesized layouts are hardly avoidable, which results in an increase in insertion loss and crosstalk noise and thus degrades the network performance. In this work, we propose a physical design tool, ToPro, which retains the interconnection among the optical switching elements by projecting the structure of a WRONoC topology onto the physical plane, and focuses on the waveguide routing to the IP-cores. To avoid the increase in insertion loss and crosstalk noise, ToPro removes the extra crossings and long detours of waveguides by changing the routing order of nets. The experimental results demonstrate the superiority of ToPro in time- and energy-efficiency. For example, compared to a state-of-the-art design automation tool, ToPro synthesizes a network with 16 IP-cores with a 17% reduction on the worst-case insertion loss and decreases the synthesis time from more than six days to less than one second.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114167510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643502
Azat Azamat, Faaiz Asim, Jongeun Lee
ReRAM (Resistive Random-Access Memory) crossbar arrays have the potential to provide extremely fast and low-cost DNN (Deep Neural Network) acceleration. However, peripheral circuits, in particular ADCs (Analog-Digital Converters), can be a large overhead and/or slow down the operation considerably. In this paper we propose to use advanced quantization techniques to reduce the ADC overhead of ReRAM crossbar arrays. Our method does not require any hardware change but can reduce the overhead of ADC greatly. Our methodology is also general, having no restriction in terms of DNN type (binarized or multi-bit) or ReRAM crossbar array size. Our experimental results using ResNet on ImageNet dataset demonstrate that our method can reduce the size of ADC by 32× compared with ISAAC at very little accuracy loss of 0.24%p.
{"title":"Quarry: Quantization-based ADC Reduction for ReRAM-based Deep Neural Network Accelerators","authors":"Azat Azamat, Faaiz Asim, Jongeun Lee","doi":"10.1109/ICCAD51958.2021.9643502","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643502","url":null,"abstract":"ReRAM (Resistive Random-Access Memory) crossbar arrays have the potential to provide extremely fast and low-cost DNN (Deep Neural Network) acceleration. However, peripheral circuits, in particular ADCs (Analog-Digital Converters), can be a large overhead and/or slow down the operation considerably. In this paper we propose to use advanced quantization techniques to reduce the ADC overhead of ReRAM crossbar arrays. Our method does not require any hardware change but can reduce the overhead of ADC greatly. Our methodology is also general, having no restriction in terms of DNN type (binarized or multi-bit) or ReRAM crossbar array size. Our experimental results using ResNet on ImageNet dataset demonstrate that our method can reduce the size of ADC by 32× compared with ISAAC at very little accuracy loss of 0.24%p.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"520 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115130388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643440
Daniel Manu, Yi Sheng, Junhuan Yang, Jieren Deng, Tong Geng, Ang Li, Caiwen Ding, Weiwen Jiang, Lei Yang
The outbreak of the global COVID-19 pandemic emphasizes the importance of collaborative drug discovery for high effectiveness; however, due to the stringent data regulation, data privacy becomes an imminent issue needing to be addressed to enable collaborative drug discovery. In addition to the data privacy issue, the efficiency of drug discovery is another key objective since infectious diseases spread exponentially and effectively conducting drug discovery could save lives. Advanced Artificial Intelligence (AI) techniques are promising to solve these problems: (1) Federated Learning (FL) is born to keep data privacy while learning data from distributed clients; (2) graph neural network (GNN) can extract structural properties of molecules whose underlying architecture is the connected atoms; and (3) generative adversarial network (GAN) can generate novel molecules while retaining the properties learned from the training data. In this work, we make the first attempt to build a holistic collaborative and privacy-preserving FL framework, namely FL-DISCO, which integrates GAN and GNN to generate molecular graphs. Experimental results demonstrate the effectiveness of FL-DISCO on: (1) IID data for ESOL and QM9, where FL-DISCO can generate highly novel compounds with high drug-likeliness, uniqueness and LogP scores compared to the baseline; (2) non-IID data for ESOL and QM9, where FL-DISCO generates 100% novel compounds with high validity and LogP scores compared to the baseline. We also demonstrate how different fractions of clients, generator and discriminator architectures affect our evaluation scores.
{"title":"FL-DISCO: Federated Generative Adversarial Network for Graph-based Molecule Drug Discovery: Special Session Paper","authors":"Daniel Manu, Yi Sheng, Junhuan Yang, Jieren Deng, Tong Geng, Ang Li, Caiwen Ding, Weiwen Jiang, Lei Yang","doi":"10.1109/ICCAD51958.2021.9643440","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643440","url":null,"abstract":"The outbreak of the global COVID-19 pandemic emphasizes the importance of collaborative drug discovery for high effectiveness; however, due to the stringent data regulation, data privacy becomes an imminent issue needing to be addressed to enable collaborative drug discovery. In addition to the data privacy issue, the efficiency of drug discovery is another key objective since infectious diseases spread exponentially and effectively conducting drug discovery could save lives. Advanced Artificial Intelligence (AI) techniques are promising to solve these problems: (1) Federated Learning (FL) is born to keep data privacy while learning data from distributed clients; (2) graph neural network (GNN) can extract structural properties of molecules whose underlying architecture is the connected atoms; and (3) generative adversarial network (GAN) can generate novel molecules while retaining the properties learned from the training data. In this work, we make the first attempt to build a holistic collaborative and privacy-preserving FL framework, namely FL-DISCO, which integrates GAN and GNN to generate molecular graphs. Experimental results demonstrate the effectiveness of FL-DISCO on: (1) IID data for ESOL and QM9, where FL-DISCO can generate highly novel compounds with high drug-likeliness, uniqueness and LogP scores compared to the baseline; (2) non-IID data for ESOL and QM9, where FL-DISCO generates 100% novel compounds with high validity and LogP scores compared to the baseline. We also demonstrate how different fractions of clients, generator and discriminator architectures affect our evaluation scores.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123543736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643585
Necati Uysal, Rickard Ewetz
Closing timing after clock tree synthesis (CTS) is very challenging in the presence of on-chip variations (OCVs). State-of-the-art design flows first synthesize an initial clock tree that contains timing violations introduced by OCVs. Next, aggressive clock tree optimization (CTO) is applied to eliminate the timing violations. Unfortunately, it may be impossible to eliminate all violations given the structure of the initial clock tree. In this paper, we propose an OCV-aware clock tree synthesis methodology that aims to rethink how to account for OCVs. The key idea is to predict the impact of OCVs early in the synthesis process, which allows the variations to be compensated for using non-uniform safety margins. This results in a synthesis flow that is almost correct-by-design. In contrast, state-of-the-art design flows often have an unpredictable success rate because the OCVs are considered too late in the synthesis process. Concretely, this is achieved by top-down constructing a virtual clock tree that is refined bottom-up into a real clock tree implementation. To balance the quality of results (QoR) and runtime, multiple top-level tree topologies are enumerated and pruned in the synthesis process. Compared with the CTO based approach, the experimental results demonstrate that the proposed methodology reduces the total negative slack (TNS) and worst negative slack (WNS) with 90% and 75%, respectively.
{"title":"An OCV-Aware Clock Tree Synthesis Methodology","authors":"Necati Uysal, Rickard Ewetz","doi":"10.1109/ICCAD51958.2021.9643585","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643585","url":null,"abstract":"Closing timing after clock tree synthesis (CTS) is very challenging in the presence of on-chip variations (OCVs). State-of-the-art design flows first synthesize an initial clock tree that contains timing violations introduced by OCVs. Next, aggressive clock tree optimization (CTO) is applied to eliminate the timing violations. Unfortunately, it may be impossible to eliminate all violations given the structure of the initial clock tree. In this paper, we propose an OCV-aware clock tree synthesis methodology that aims to rethink how to account for OCVs. The key idea is to predict the impact of OCVs early in the synthesis process, which allows the variations to be compensated for using non-uniform safety margins. This results in a synthesis flow that is almost correct-by-design. In contrast, state-of-the-art design flows often have an unpredictable success rate because the OCVs are considered too late in the synthesis process. Concretely, this is achieved by top-down constructing a virtual clock tree that is refined bottom-up into a real clock tree implementation. To balance the quality of results (QoR) and runtime, multiple top-level tree topologies are enumerated and pruned in the synthesis process. Compared with the CTO based approach, the experimental results demonstrate that the proposed methodology reduces the total negative slack (TNS) and worst negative slack (WNS) with 90% and 75%, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124829213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643445
Juzheng Liu, Shiyu Su, Meghna Madhusudan, Mohsen Hassanpourghadi, Samuel Saunders, Qiaochu Zhang, Rezwan A. Rasul, Yaguang Li, Jiang Hu, A. Sharma, S. Sapatnekar, R. Harjani, Anthony Levi, S. Gupta, M. Chen
We propose a complete analog mixed-signal circuit design flow from specification to silicon with minimum human-in-the-loop interaction, and verify the flow in a 12nm FinFET CMOS process. The flow consists of three key elements: neural network (NN) modeling of the parameterized circuit component, a search algorithm based on NN models to determine its sizing, and layout automation. To reduce the required training data for NN model creation, we utilize transfer learning to improve the NN accuracy from a relatively small amount of post-layout/silicon data. To prove the concept, we use a voltage-controlled oscillator (VCO) as a test vehicle and demonstrate that our design methodology can accurately model the circuit and generate designs with a wide range of specifications. We show that circuit sizing based on the transfer learned NN model from silicon measurement data yields the most accurate results.
{"title":"From Specification to Silicon: Towards Analog/Mixed-Signal Design Automation using Surrogate NN Models with Transfer Learning","authors":"Juzheng Liu, Shiyu Su, Meghna Madhusudan, Mohsen Hassanpourghadi, Samuel Saunders, Qiaochu Zhang, Rezwan A. Rasul, Yaguang Li, Jiang Hu, A. Sharma, S. Sapatnekar, R. Harjani, Anthony Levi, S. Gupta, M. Chen","doi":"10.1109/ICCAD51958.2021.9643445","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643445","url":null,"abstract":"We propose a complete analog mixed-signal circuit design flow from specification to silicon with minimum human-in-the-loop interaction, and verify the flow in a 12nm FinFET CMOS process. The flow consists of three key elements: neural network (NN) modeling of the parameterized circuit component, a search algorithm based on NN models to determine its sizing, and layout automation. To reduce the required training data for NN model creation, we utilize transfer learning to improve the NN accuracy from a relatively small amount of post-layout/silicon data. To prove the concept, we use a voltage-controlled oscillator (VCO) as a test vehicle and demonstrate that our design methodology can accurately model the circuit and generate designs with a wide range of specifications. We show that circuit sizing based on the transfer learned NN model from silicon measurement data yields the most accurate results.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127523868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643564
Rongjian Liang, Jinwook Jung, Hua Xiang, L. Reddy, Alexey Lvov, Jiang Hu, Gi-Joon Nam
EDA tools provide a large spectrum of parameters to help designers achieve the maximized PPA of designs. The corresponding enormous solution space, however, hinders designers from navigating towards optimal solutions. In this paper, we propose a multi-stage automatic flow tuning tool, named FlowTuner, for efficient and effective parameter tuning of VLSI design flow. It utilizes both exploitation using transferred parameter knowledge from archival design data and exploration via a multi-stage cooperative co-evolutionary framework. Furthermore, novel flow jump-start and early-stop techniques are developed to reduce the overall runtime for tuning. Experiments on a set of IWLS 2005 benchmark circuits through a commercial tool flow demonstrate that FlowTuner produces considerably better design outcomes in 50 % shorter turnaround time compared to the state-of-the-art flow tuning techniques.
{"title":"FlowTuner: A Multi-Stage EDA Flow Tuner Exploiting Parameter Knowledge Transfer","authors":"Rongjian Liang, Jinwook Jung, Hua Xiang, L. Reddy, Alexey Lvov, Jiang Hu, Gi-Joon Nam","doi":"10.1109/ICCAD51958.2021.9643564","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643564","url":null,"abstract":"EDA tools provide a large spectrum of parameters to help designers achieve the maximized PPA of designs. The corresponding enormous solution space, however, hinders designers from navigating towards optimal solutions. In this paper, we propose a multi-stage automatic flow tuning tool, named FlowTuner, for efficient and effective parameter tuning of VLSI design flow. It utilizes both exploitation using transferred parameter knowledge from archival design data and exploration via a multi-stage cooperative co-evolutionary framework. Furthermore, novel flow jump-start and early-stop techniques are developed to reduce the overall runtime for tuning. Experiments on a set of IWLS 2005 benchmark circuits through a commercial tool flow demonstrate that FlowTuner produces considerably better design outcomes in 50 % shorter turnaround time compared to the state-of-the-art flow tuning techniques.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643499
N. Pinckney, Rangharajan Venkatesan, Ben Keller, Brucek Khailany
High-level synthesis (HLS) has recently been used to improve design productivity for many units in today's complex SoCs. HLS tools and flows improve chip design productivity by enabling prototyping and automated implementation of RTL from a single codebase. Although interconnect design is a critical part of today's highly complex SoCs, HLS has not historically been used for SoC-level interconnect. One reason for this is that interconnect architecture and physical floorplan are tightly coupled, and can be difficult to estimate early in the design process. To address this gap, we propose IPA (Interconnect Prototyping Assistant), a framework for interconnect prototyping and implementation in HLS-based SoC flows. IPA includes an application programming interface (API) and accompanying tools that automate interconnect modeling and generation for SystemC-based designs. Our framework is used during early architectural prototyping by abstracting specifics of interconnect implementation. IPA then generates interconnect models, including interfaces, for SystemC cycle-accurate simulations. If the design requires long wires between communication units, IPA automatically inserts retiming stages to meet clock frequency targets. IPA's SystemC code is fully HLS-compatible for RTL creation, and thus can be used within a full-chip HLS flow for pushbutton interconnect generation once a design point is selected. IPA provides accurate architectural performance feedback in minutes and can generate high-quality RTL implementations for SoC interconnect in hours. We demonstrate IPA by exploring the design space for an on-chip interconnect on a micro-benchmark and a deep learning accelerator.
{"title":"IPA: Floorplan-Aware SystemC Interconnect Performance Modeling and Generation for HLS-based SoCs","authors":"N. Pinckney, Rangharajan Venkatesan, Ben Keller, Brucek Khailany","doi":"10.1109/ICCAD51958.2021.9643499","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643499","url":null,"abstract":"High-level synthesis (HLS) has recently been used to improve design productivity for many units in today's complex SoCs. HLS tools and flows improve chip design productivity by enabling prototyping and automated implementation of RTL from a single codebase. Although interconnect design is a critical part of today's highly complex SoCs, HLS has not historically been used for SoC-level interconnect. One reason for this is that interconnect architecture and physical floorplan are tightly coupled, and can be difficult to estimate early in the design process. To address this gap, we propose IPA (Interconnect Prototyping Assistant), a framework for interconnect prototyping and implementation in HLS-based SoC flows. IPA includes an application programming interface (API) and accompanying tools that automate interconnect modeling and generation for SystemC-based designs. Our framework is used during early architectural prototyping by abstracting specifics of interconnect implementation. IPA then generates interconnect models, including interfaces, for SystemC cycle-accurate simulations. If the design requires long wires between communication units, IPA automatically inserts retiming stages to meet clock frequency targets. IPA's SystemC code is fully HLS-compatible for RTL creation, and thus can be used within a full-chip HLS flow for pushbutton interconnect generation once a design point is selected. IPA provides accurate architectural performance feedback in minutes and can generate high-quality RTL implementations for SoC interconnect in hours. We demonstrate IPA by exploring the design space for an on-chip interconnect on a micro-benchmark and a deep learning accelerator.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121088737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643489
Zhiqiang Liu, Wenjian Yu
Due to the rapid advance of the integrated circuit technology, power grid analysis usually imposes a severe computational challenge, where linear equations with millions or even billions of unknowns need to be solved. Recent graph spectral sparsification techniques have shown promising performance in accelerating power grid analysis. However, previous graph sparsification based iterative solvers are restricted by difficulty of parallelization. Existing graph sparsification algorithms are implemented under the assumption of serial computing, while factorization and backward/forward substitution of the spar-sifier's Laplacian matrix are also hard to parallelize. On the other hand, partition based iterative methods which can be easily parallelized lack a direct control of the relative condition number of the preconditioner and consume more memory. In this work, we propose a novel parallel iterative solver for scalable power grid analysis by integrating graph sparsification techniques and partition based methods. We first propose a practically-efficient parallel graph sparsification algorithm. Then, domain decomposition method is leveraged to solve the sparsifier's Laplacian matrix. An efficient graph sparsification based parallel preconditioner is obtained, which not only leads to fast convergence but also enjoys ease of parallelization. Extensive experiments are carried out to demonstrate the superior efficiency of the proposed solver for large-scale power grid analysis, showing 5.2X speedup averagely over the state-of-the-art parallel iterative solver. Moreover, it solves a real-world power grid matrix with 0.36 billion nodes and 8.7 billion nonzeros within 23 minutes on a 16-core machine, which is 9.5X faster than the best result of sequential graph sparsification based solver.
{"title":"pGRASS-Solver: A Parallel Iterative Solver for Scalable Power Grid Analysis Based on Graph Spectral Sparsification","authors":"Zhiqiang Liu, Wenjian Yu","doi":"10.1109/ICCAD51958.2021.9643489","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643489","url":null,"abstract":"Due to the rapid advance of the integrated circuit technology, power grid analysis usually imposes a severe computational challenge, where linear equations with millions or even billions of unknowns need to be solved. Recent graph spectral sparsification techniques have shown promising performance in accelerating power grid analysis. However, previous graph sparsification based iterative solvers are restricted by difficulty of parallelization. Existing graph sparsification algorithms are implemented under the assumption of serial computing, while factorization and backward/forward substitution of the spar-sifier's Laplacian matrix are also hard to parallelize. On the other hand, partition based iterative methods which can be easily parallelized lack a direct control of the relative condition number of the preconditioner and consume more memory. In this work, we propose a novel parallel iterative solver for scalable power grid analysis by integrating graph sparsification techniques and partition based methods. We first propose a practically-efficient parallel graph sparsification algorithm. Then, domain decomposition method is leveraged to solve the sparsifier's Laplacian matrix. An efficient graph sparsification based parallel preconditioner is obtained, which not only leads to fast convergence but also enjoys ease of parallelization. Extensive experiments are carried out to demonstrate the superior efficiency of the proposed solver for large-scale power grid analysis, showing 5.2X speedup averagely over the state-of-the-art parallel iterative solver. Moreover, it solves a real-world power grid matrix with 0.36 billion nodes and 8.7 billion nonzeros within 23 minutes on a 16-core machine, which is 9.5X faster than the best result of sequential graph sparsification based solver.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126373963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}