Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643577
M. Ziegler, Jihye Kwon, Hung-Yi Liu, L. Carloni
Modern logic and physical synthesis tools provide numerous options and parameters that can drastically affect design quality; however, the large number of options leads to a complex design space difficult for human designers to navigate. Fortunately, machine learning approaches and cloud computing environments are well suited for tackling complex parameter tuning problems like those seen in VLSI design flows. This paper proposes a holistic approach where online and offline machine learning approaches work together for tuning industrial design flows. We describe a system called SynTunSys (STS) that has been used to optimize multiple industrial high-performance processors. STS consists of an online system that optimizes designs and generates data for a recommender system that performs offline training and recommendation. Experimental results show the collaboration between STS online and offline machine learning systems as well as insight from human designers provide best-of-breed results. Finally, we discuss potential new directions for research on design flow tuning.
{"title":"Online and Offline Machine Learning for Industrial Design Flow Tuning: (Invited - ICCAD Special Session Paper)","authors":"M. Ziegler, Jihye Kwon, Hung-Yi Liu, L. Carloni","doi":"10.1109/ICCAD51958.2021.9643577","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643577","url":null,"abstract":"Modern logic and physical synthesis tools provide numerous options and parameters that can drastically affect design quality; however, the large number of options leads to a complex design space difficult for human designers to navigate. Fortunately, machine learning approaches and cloud computing environments are well suited for tackling complex parameter tuning problems like those seen in VLSI design flows. This paper proposes a holistic approach where online and offline machine learning approaches work together for tuning industrial design flows. We describe a system called SynTunSys (STS) that has been used to optimize multiple industrial high-performance processors. STS consists of an online system that optimizes designs and generates data for a recommender system that performs offline training and recommendation. Experimental results show the collaboration between STS online and offline machine learning systems as well as insight from human designers provide best-of-breed results. Finally, we discuss potential new directions for research on design flow tuning.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125358481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643467
Bulbul Ahmed, Fahim Rahman, Nick Hooten, Farimah Farahmandi, M. Tehranipoor
The security of system-on-chip (SoC) designs is threatened by many vulnerabilities introduced by untrusted third-party IPs, and designers and CAD tools' lack of awareness of security requirements. Ensuring the security of an SoC has become highly challenging due to the diverse threat models, high design complexity, and lack of effective security-aware verification solutions. Moreover, new security vulnerabilities are introduced during the design transformation from higher to lower abstraction levels. As a result, security verification becomes a major bottleneck that should be performed at every level of design abstraction. Reducing the verification effort by mapping the security properties at different design stages could be an efficient solution to lower the total verification time if the new vulnerabilities introduced at different abstraction levels are addressed properly. To address this challenge, we introduce AutoMap that, in addition to the mapping, extends and expands the security properties to identify new vulnerabilities introduced when the design moves from higher-to lower-level abstraction. Starting at the higher abstraction level with a defined set of security properties for the target threat models, AutoMap automatically maps the properties to the lower levels of abstraction to reduce the verification effort. Furthermore, it extends and expands the properties to cover new vulnerabilities introduced by design transformations and updates to the lower abstraction level. We demonstrate AutoMap's efficacy by applying it to AES, RSA, and SHA256 at C++, RTL, and gate-level. We show that AutoMap effectively facilitates the detection of security vulnerabilities from different sources during the design transformation.
{"title":"AutoMap: Automated Mapping of Security Properties Between Different Levels of Abstraction in Design Flow","authors":"Bulbul Ahmed, Fahim Rahman, Nick Hooten, Farimah Farahmandi, M. Tehranipoor","doi":"10.1109/ICCAD51958.2021.9643467","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643467","url":null,"abstract":"The security of system-on-chip (SoC) designs is threatened by many vulnerabilities introduced by untrusted third-party IPs, and designers and CAD tools' lack of awareness of security requirements. Ensuring the security of an SoC has become highly challenging due to the diverse threat models, high design complexity, and lack of effective security-aware verification solutions. Moreover, new security vulnerabilities are introduced during the design transformation from higher to lower abstraction levels. As a result, security verification becomes a major bottleneck that should be performed at every level of design abstraction. Reducing the verification effort by mapping the security properties at different design stages could be an efficient solution to lower the total verification time if the new vulnerabilities introduced at different abstraction levels are addressed properly. To address this challenge, we introduce AutoMap that, in addition to the mapping, extends and expands the security properties to identify new vulnerabilities introduced when the design moves from higher-to lower-level abstraction. Starting at the higher abstraction level with a defined set of security properties for the target threat models, AutoMap automatically maps the properties to the lower levels of abstraction to reduce the verification effort. Furthermore, it extends and expands the properties to cover new vulnerabilities introduced by design transformations and updates to the lower abstraction level. We demonstrate AutoMap's efficacy by applying it to AES, RSA, and SHA256 at C++, RTL, and gate-level. We show that AutoMap effectively facilitates the detection of security vulnerabilities from different sources during the design transformation.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129594154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643456
Chao-Yuan Huang, Yi-Chen Chang, Ming-Jer Tsai, Tsung-Yi Ho
The Adiabatic Quantum-Flux-Parametron (AQFP), which benefits from low power consumption and rapid switching, is one of the rising superconducting logics. Due to the rapid switching, the delay of the inputs of an AQFP gate is strictly specified so that additional buffers are needed to synchronize the delay. Meanwhile, to maintain the symmetry layout of gates and reduce the undesired parasitic magnetic coupling, the AQFP cell library adopts the minimalist design method in which splitters are employed for the gates with multiple fan-outs. Thus, an AQFP circuit may demand numerous splitters and buffers, resulting in a considerable amount of power consumption and delay. This provides a motivation for proposing an effective splitter and buffer insertion algorithm for the AQFP circuits. In this paper, we propose a dynamic programming-based algorithm that provides an optimal splitter and buffer insertion for each wire of the input circuit. Experimental results show that our method is fast, and has a 10% reduction of additional Josephson Junctions (JJs) in the complicated circuits compared with the state-of-the-art method.
{"title":"An Optimal Algorithm for Splitter and Buffer Insertion in Adiabatic Quantum-Flux-Parametron Circuits","authors":"Chao-Yuan Huang, Yi-Chen Chang, Ming-Jer Tsai, Tsung-Yi Ho","doi":"10.1109/ICCAD51958.2021.9643456","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643456","url":null,"abstract":"The Adiabatic Quantum-Flux-Parametron (AQFP), which benefits from low power consumption and rapid switching, is one of the rising superconducting logics. Due to the rapid switching, the delay of the inputs of an AQFP gate is strictly specified so that additional buffers are needed to synchronize the delay. Meanwhile, to maintain the symmetry layout of gates and reduce the undesired parasitic magnetic coupling, the AQFP cell library adopts the minimalist design method in which splitters are employed for the gates with multiple fan-outs. Thus, an AQFP circuit may demand numerous splitters and buffers, resulting in a considerable amount of power consumption and delay. This provides a motivation for proposing an effective splitter and buffer insertion algorithm for the AQFP circuits. In this paper, we propose a dynamic programming-based algorithm that provides an optimal splitter and buffer insertion for each wire of the input circuit. Experimental results show that our method is fast, and has a 10% reduction of additional Josephson Junctions (JJs) in the complicated circuits compared with the state-of-the-art method.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128569035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643505
He-Teng Zhang, Jie-Hong R. Jiang, A. Mishchenko
In recent years SAT solving has been widely used to implement various circuit transformations in logic synthesis. However, off-the-shelf CNF-based SAT solvers often have suboptimal performance on these challenging optimization problems. This paper describes an application-specific circuit-based SAT solver for logic synthesis. The solver is based on Glucose, a state-of-the-art CNF-based solver and adds a number of novel features, which make it run faster on multiple incremental SAT problems arising in redundancy removal and logic restructuring among others. In particular, the circuit structure of the problem instance is leveraged in a new way to guide variable decisions and to converge to a solution faster for both satisfiable and unsatisfiable instances. Experimental results indicate that the proposed solver leads to a 2-4x speedup, compared to the original Glucose.
{"title":"A Circuit-Based SAT Solver for Logic Synthesis","authors":"He-Teng Zhang, Jie-Hong R. Jiang, A. Mishchenko","doi":"10.1109/ICCAD51958.2021.9643505","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643505","url":null,"abstract":"In recent years SAT solving has been widely used to implement various circuit transformations in logic synthesis. However, off-the-shelf CNF-based SAT solvers often have suboptimal performance on these challenging optimization problems. This paper describes an application-specific circuit-based SAT solver for logic synthesis. The solver is based on Glucose, a state-of-the-art CNF-based solver and adds a number of novel features, which make it run faster on multiple incremental SAT problems arising in redundancy removal and logic restructuring among others. In particular, the circuit structure of the problem instance is leveraged in a new way to guide variable decisions and to converge to a solution faster for both satisfiable and unsatisfiable instances. Experimental results indicate that the proposed solver leads to a 2-4x speedup, compared to the original Glucose.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124566525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643584
Yu Zeng, Bo-Yuan Huang, Hongce Zhang, Aarti Gupta, S. Malik
Today's Systems-on-Chips (SoCs) comprise general/special purpose programmable processors and specialized hardware modules referred to as accelerators. These accelerators serve as co-processors and are invoked through software or firmware. Thus, verifying SoCs requires co-verification of hardware with software/firmware. Co-verification using cycle-accurate hardware models is often not scalable, and requires hardware abstractions. Among various abstractions, architecture-level abstractions are very effective as they retain only the software visible state. An Instruction-Set Architecture (ISA) serves this role for processors and such ISA-like abstractions are also desirable for accelerators. Manually creating such abstractions for accelerators is tedious and error-prone, and there is a growing need for automation in deriving them from existing Register-Transfer Level (RTL) implementations. An important part of this automation is determining which state variables to retain in the abstract model. For processors and accelerators, this set of variables is naturally the Architectural State Variables (ASVs) - variables that are persistent across instructions. This paper presents the first work to automatically determine ASVs of processors and accelerators from their RTL implementations. We propose three novel algorithms based on different characteristics of ASVs. Each algorithm provides a sound abstraction, i.e., an over-approximate set of ASVs. The quality of the abstraction is measured by the size of the set of ASVs computed. Experiments on several processors and accelerators demonstrate that these algorithms perform best in different cases, and by combining them a high quality set of ASVs can be found in reasonable time.
{"title":"Generating Architecture-Level Abstractions from RTL Designs for Processors and Accelerators Part I: Determining Architectural State Variables","authors":"Yu Zeng, Bo-Yuan Huang, Hongce Zhang, Aarti Gupta, S. Malik","doi":"10.1109/ICCAD51958.2021.9643584","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643584","url":null,"abstract":"Today's Systems-on-Chips (SoCs) comprise general/special purpose programmable processors and specialized hardware modules referred to as accelerators. These accelerators serve as co-processors and are invoked through software or firmware. Thus, verifying SoCs requires co-verification of hardware with software/firmware. Co-verification using cycle-accurate hardware models is often not scalable, and requires hardware abstractions. Among various abstractions, architecture-level abstractions are very effective as they retain only the software visible state. An Instruction-Set Architecture (ISA) serves this role for processors and such ISA-like abstractions are also desirable for accelerators. Manually creating such abstractions for accelerators is tedious and error-prone, and there is a growing need for automation in deriving them from existing Register-Transfer Level (RTL) implementations. An important part of this automation is determining which state variables to retain in the abstract model. For processors and accelerators, this set of variables is naturally the Architectural State Variables (ASVs) - variables that are persistent across instructions. This paper presents the first work to automatically determine ASVs of processors and accelerators from their RTL implementations. We propose three novel algorithms based on different characteristics of ASVs. Each algorithm provides a sound abstraction, i.e., an over-approximate set of ASVs. The quality of the abstraction is measured by the size of the set of ASVs computed. Experiments on several processors and accelerators demonstrate that these algorithms perform best in different cases, and by combining them a high quality set of ASVs can be found in reasonable time.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643573
Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Helen Li, Yiran Chen
Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup in recommendation models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access memory (ReRAM) based process-in-memory (PIM) architecture to accelerate recommendation processing can avoid data movements caused by off-chip memory accesses. However, naïve adoption of ReRAM-based DNN accelerators leads to low computation parallelism and severe under-utilization of computing resources, which is caused by the fine-grained inner-product in feature interaction. In this paper, we propose Rerec, an architecture-algorithm co-designed accelerator, which specializes in fine-grained ReRAM-based inner-product engines with access-aware mapping algorithm for recommendation inference. At the architecture level, we reduce the size and increase the amount of crossbars. The crossbars are fully-connected by Analog-to-Digital Converters (ADCs) in one inner-product engine, which can adapt to the fine-grained and irregular computational patterns and improve the processing parallelism. We further explore trade-offs of (i) crossbar size vs. hardware utilization, and (ii) ADC implementation vs. area/energy efficiency to optimize the design. At the algorithm level, we propose a novel access-aware mapping (AAM) algorithm to optimize resource allocations. Our AAM algorithm tackles the problems of (i) the workload imbalance and (ii) the long recommendation inference latency induced by the great variance of access frequency of embedding vectors. Experimental results show that Rerecachieves 7.69x speedup compared with a ReRAM-based baseline design. Compared to CPU and the state-of-the-art recommendation accelerator, Rerecdemonstrates 29.26x and 3.48x performance improvement, respectively.
{"title":"Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation","authors":"Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Helen Li, Yiran Chen","doi":"10.1109/ICCAD51958.2021.9643573","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643573","url":null,"abstract":"Personalized recommendation systems are widely used in many Internet services. The sparse embedding lookup in recommendation models dominates the computational cost of inference due to its intensive irregular memory accesses. Applying resistive random access memory (ReRAM) based process-in-memory (PIM) architecture to accelerate recommendation processing can avoid data movements caused by off-chip memory accesses. However, naïve adoption of ReRAM-based DNN accelerators leads to low computation parallelism and severe under-utilization of computing resources, which is caused by the fine-grained inner-product in feature interaction. In this paper, we propose Rerec, an architecture-algorithm co-designed accelerator, which specializes in fine-grained ReRAM-based inner-product engines with access-aware mapping algorithm for recommendation inference. At the architecture level, we reduce the size and increase the amount of crossbars. The crossbars are fully-connected by Analog-to-Digital Converters (ADCs) in one inner-product engine, which can adapt to the fine-grained and irregular computational patterns and improve the processing parallelism. We further explore trade-offs of (i) crossbar size vs. hardware utilization, and (ii) ADC implementation vs. area/energy efficiency to optimize the design. At the algorithm level, we propose a novel access-aware mapping (AAM) algorithm to optimize resource allocations. Our AAM algorithm tackles the problems of (i) the workload imbalance and (ii) the long recommendation inference latency induced by the great variance of access frequency of embedding vectors. Experimental results show that Rerecachieves 7.69x speedup compared with a ReRAM-based baseline design. Compared to CPU and the state-of-the-art recommendation accelerator, Rerecdemonstrates 29.26x and 3.48x performance improvement, respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130459483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To enable the performance optimization of application mapping on modern field-programmable gate arrays (FPGAs), certain critical path portions of the designs might be prearranged into many multi-cell macros during synthesis. These movable macros with constraints of shape and resources lead to challenging mixed-size placement for FPGA designs which cannot be addressed by previous works of analytical placers. In this work, we propose AMF-Placer, an open-source Analytical Mixed-size FPGA placer supporting mixed-size placement on FPGA, with an interface to Xilinx Vivado. To speed up the convergence and improve the quality of the placement, AMF-Placer is equipped with a series of new techniques for wirelength optimization, cell spreading, packing, and legalization. Based on a set of the latest large open-source benchmarks from various domains for Xilinx Ultrascale FPGAs, experimental results indicate that AMF-Placer can improve HPWL by 20.4%-89.3% and reduce runtime by 8.0%-84.2%, compared to the baseline. Furthermore, utilizing the parallelism of the proposed algorithms, with 8 threads, the placement procedure can be accelerated by 2.41x on average.
{"title":"AMF-Placer: High-Performance Analytical Mixed-size Placer for FPGA","authors":"Tingyuan Liang, Gengjie Chen, Jieru Zhao, Sharad Sinha, Wei Zhang","doi":"10.1109/ICCAD51958.2021.9643574","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643574","url":null,"abstract":"To enable the performance optimization of application mapping on modern field-programmable gate arrays (FPGAs), certain critical path portions of the designs might be prearranged into many multi-cell macros during synthesis. These movable macros with constraints of shape and resources lead to challenging mixed-size placement for FPGA designs which cannot be addressed by previous works of analytical placers. In this work, we propose AMF-Placer, an open-source Analytical Mixed-size FPGA placer supporting mixed-size placement on FPGA, with an interface to Xilinx Vivado. To speed up the convergence and improve the quality of the placement, AMF-Placer is equipped with a series of new techniques for wirelength optimization, cell spreading, packing, and legalization. Based on a set of the latest large open-source benchmarks from various domains for Xilinx Ultrascale FPGAs, experimental results indicate that AMF-Placer can improve HPWL by 20.4%-89.3% and reduce runtime by 8.0%-84.2%, compared to the baseline. Furthermore, utilizing the parallelism of the proposed algorithms, with 8 threads, the placement procedure can be accelerated by 2.41x on average.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116930343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643559
Biresh Kumar Joardar, Aqeeb Iqbal Arka, J. Doppa, P. Pande, Hai Helen Li, K. Chakrabarty
Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have recently become a popular architectural choice for deep-learning applications. ReRAM-based architectures can accelerate inferencing and training of deep learning algorithms and are more energy efficient compared to traditional GPUs. However, these architectures have various limitations that affect the model accuracy and performance. Moreover, the choice of the deep-learning application also imposes new design challenges that must be addressed to achieve high performance. In this paper, we present the advantages and challenges associated with ReRAM-based PIM architectures by considering Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs) as important application domains. We also outline methods that can be used to address these challenges.
{"title":"Heterogeneous Manycore Architectures Enabled by Processing-in-Memory for Deep Learning: From CNNs to GNNs: (ICCAD Special Session Paper)","authors":"Biresh Kumar Joardar, Aqeeb Iqbal Arka, J. Doppa, P. Pande, Hai Helen Li, K. Chakrabarty","doi":"10.1109/ICCAD51958.2021.9643559","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643559","url":null,"abstract":"Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have recently become a popular architectural choice for deep-learning applications. ReRAM-based architectures can accelerate inferencing and training of deep learning algorithms and are more energy efficient compared to traditional GPUs. However, these architectures have various limitations that affect the model accuracy and performance. Moreover, the choice of the deep-learning application also imposes new design challenges that must be addressed to achieve high performance. In this paper, we present the advantages and challenges associated with ReRAM-based PIM architectures by considering Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs) as important application domains. We also outline methods that can be used to address these challenges.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132640181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643471
Myeonggu Kang, Hyein Shin, Jaekang Shin, L. Kim
With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.
{"title":"A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators","authors":"Myeonggu Kang, Hyein Shin, Jaekang Shin, L. Kim","doi":"10.1109/ICCAD51958.2021.9643471","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643471","url":null,"abstract":"With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116407809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643457
Zizheng Guo, Tsung-Wei Huang, Yibo Lin
Common path pessimism removal (CPPR) is a key step to eliminating unwanted pessimism during static timing analysis (STA). Unwanted pessimism will force designers and optimization algorithms to waste a significant yet unnecessary amount of effort on fixing paths that meet the intended timing constraints. However, CPPR is extremely time-consuming and can incur 10–100× runtime overheads to complete. Existing solutions for speeding up CPPR are architecturally constrained by CPU-only parallelism, and their runtimes do not scale beyond 8–16 cores. In this paper, we introduce HeteroCPPR, a new algorithm to accelerate CPPR by harnessing the power of heterogeneous CPU-GPU parallelism. We devise an efficient CPU-GPU task decomposition strategy and highly optimized GPU kernels to handle CPPR that scales to large numbers of paths. Also, HeteroCPPR can scale to multiple GPUs. As an example, HeteroCPPR is up to 16×faster than a state-of-the-art CPU-parallel CPPR algorithm for completing the analysis of 10K post-CPPR critical paths in a million-gate design under a machine of 40 CPUs and 4 GPUs.
{"title":"HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism","authors":"Zizheng Guo, Tsung-Wei Huang, Yibo Lin","doi":"10.1109/ICCAD51958.2021.9643457","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643457","url":null,"abstract":"Common path pessimism removal (CPPR) is a key step to eliminating unwanted pessimism during static timing analysis (STA). Unwanted pessimism will force designers and optimization algorithms to waste a significant yet unnecessary amount of effort on fixing paths that meet the intended timing constraints. However, CPPR is extremely time-consuming and can incur 10–100× runtime overheads to complete. Existing solutions for speeding up CPPR are architecturally constrained by CPU-only parallelism, and their runtimes do not scale beyond 8–16 cores. In this paper, we introduce HeteroCPPR, a new algorithm to accelerate CPPR by harnessing the power of heterogeneous CPU-GPU parallelism. We devise an efficient CPU-GPU task decomposition strategy and highly optimized GPU kernels to handle CPPR that scales to large numbers of paths. Also, HeteroCPPR can scale to multiple GPUs. As an example, HeteroCPPR is up to 16×faster than a state-of-the-art CPU-parallel CPPR algorithm for completing the analysis of 10K post-CPPR critical paths in a million-gate design under a machine of 40 CPUs and 4 GPUs.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117307705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}