Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643487
Yang Bai, Xufeng Yao, Qi Sun, Bei Yu
Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.
{"title":"AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU","authors":"Yang Bai, Xufeng Yao, Qi Sun, Bei Yu","doi":"10.1109/ICCAD51958.2021.9643487","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643487","url":null,"abstract":"Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130852835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A popular way to implement an arithmetic function is through a lookup table (LUT), which stores the pre-computed outputs for all the inputs. However, its size grows exponentially with the number of input bits. In this work, targeting at computing kernels of error-tolerant applications, we propose DALTA, a reconfigurable decomposition-based approximate lookup table architecture, to approximately implement those kernels with dramatically reduced size. We also propose integer linear programming-based approximate decomposition methods to map a given function to the architecture. Our architecture features with low energy consumption and high speed. The experimental results show that our architecture achieves energy and latency savings by 56.5% and 92.4%, respectively, over the state-of-the-art approximate LUT architecture.
{"title":"DALTA: A Decomposition-based Approximate Lookup Table Architecture","authors":"Chang Meng, Z. Xiang, Niyiqiu Liu, Yixuan Hu, Jiahao Song, Runsheng Wang, Ru Huang, Weikang Qian","doi":"10.1109/ICCAD51958.2021.9643562","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643562","url":null,"abstract":"A popular way to implement an arithmetic function is through a lookup table (LUT), which stores the pre-computed outputs for all the inputs. However, its size grows exponentially with the number of input bits. In this work, targeting at computing kernels of error-tolerant applications, we propose DALTA, a reconfigurable decomposition-based approximate lookup table architecture, to approximately implement those kernels with dramatically reduced size. We also propose integer linear programming-based approximate decomposition methods to map a given function to the architecture. Our architecture features with low energy consumption and high speed. The experimental results show that our architecture achieves energy and latency savings by 56.5% and 92.4%, respectively, over the state-of-the-art approximate LUT architecture.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131200984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643519
Shouvanik Chakrabarti, Xuchen You, Xiaodi Wu
Quantum Variational Methods are promising near-term applications of quantum machines, not only because of their potential advantages in solving certain computational tasks and understanding quantum physics but also because of their feasibility on near-term quantum machines. However, many challenges remain in order to unleash the full potential of quantum variational methods, especially in the design of efficient training methods for each domain-specific quantum variational ansatzes. This paper proposes a theory-guided principle in order to tackle the training issue of quantum variational methods and highlights some successful examples.
{"title":"ICCAD Special Session Paper: Quantum Variational Methods for Quantum Applications","authors":"Shouvanik Chakrabarti, Xuchen You, Xiaodi Wu","doi":"10.1109/ICCAD51958.2021.9643519","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643519","url":null,"abstract":"Quantum Variational Methods are promising near-term applications of quantum machines, not only because of their potential advantages in solving certain computational tasks and understanding quantum physics but also because of their feasibility on near-term quantum machines. However, many challenges remain in order to unleash the full potential of quantum variational methods, especially in the design of efficient training methods for each domain-specific quantum variational ansatzes. This paper proposes a theory-guided principle in order to tackle the training issue of quantum variational methods and highlights some successful examples.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115522640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643513
Sanmitra Banerjee, Arjun Chaudhuri, Jinwoo Kim, Gauthaman Murali, M. Nelson, S. Lim, K. Chakrabarty
Carbon nanotube FETs (CNFETs) are emerging as an alternative to silicon devices for next-generation computing systems. However, imperfect carbon nanotube deposition during CNFET fabrication can lead to the formation of difficult-to-etch CNT aggregates in the active layer. These CNT aggregates can form parasitic CNFETs (para-FETs) that are modulated by adjoining gate contacts or back-end-of-line metal layers, thereby forming conditional shorts and stuck-at faults. We show that even weak (parametric) para-FETs can lead to a degraded static noise margin in CNFET-based design. We propose ParaMitE, a layout optimization method that horizontally flips selected standard cells in situ to minimize the number of para-FETs that can arise due to unetched CNTs. As we modify only the cell orientation (and not the cell placement), the impact on the power, timing, and wire length of the CNFET-based design is negligible. Simulation results for several benchmarks show that the proposed method can mitigate up to 60% of the possible para-FET locations (90% of the most critical locations) with only a 3% increase in the total wire length. ParaMitE can enable yield ramp-up at the foundry by providing guidance on which para-FETs can be avoided by design, and conversely, which CNT aggregates must be removed through processing steps.
{"title":"ParaMitE: Mitigating Parasitic CNFETs in the Presence of Unetched CNTs","authors":"Sanmitra Banerjee, Arjun Chaudhuri, Jinwoo Kim, Gauthaman Murali, M. Nelson, S. Lim, K. Chakrabarty","doi":"10.1109/ICCAD51958.2021.9643513","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643513","url":null,"abstract":"Carbon nanotube FETs (CNFETs) are emerging as an alternative to silicon devices for next-generation computing systems. However, imperfect carbon nanotube deposition during CNFET fabrication can lead to the formation of difficult-to-etch CNT aggregates in the active layer. These CNT aggregates can form parasitic CNFETs (para-FETs) that are modulated by adjoining gate contacts or back-end-of-line metal layers, thereby forming conditional shorts and stuck-at faults. We show that even weak (parametric) para-FETs can lead to a degraded static noise margin in CNFET-based design. We propose ParaMitE, a layout optimization method that horizontally flips selected standard cells in situ to minimize the number of para-FETs that can arise due to unetched CNTs. As we modify only the cell orientation (and not the cell placement), the impact on the power, timing, and wire length of the CNFET-based design is negligible. Simulation results for several benchmarks show that the proposed method can mitigate up to 60% of the possible para-FET locations (90% of the most critical locations) with only a 3% increase in the total wire length. ParaMitE can enable yield ramp-up at the foundry by providing guidance on which para-FETs can be avoided by design, and conversely, which CNT aggregates must be removed through processing steps.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/iccad51958.2021.9643560
Fu-Yu Chuang, Yao-Wen Chang
Photonic integrated circuits (PICs), which introduce optical interconnections for on-chip communication, have become one of the most promising solutions to the increasing requirements with large bandwidth and low-power consumption. Routing techniques for optical interconnections have been proposed to deal with various routing issues in PICs, including transmission losses, thermal reliability, etc. However, in some emerging applications, different optical paths should be closely matched (in terms of the path length, the number of bends, the radius of curvature of bends, and the crossing count) to operate correctly. To the best of our knowledge, no previous work deals with these matching constraints in optical routing. This paper proposes a complete algorithm flow based on an optimal Steiner tree construction and integer linear programming with a hexagonal routing style to handle the matching constraints while minimizing the total transmission loss in a design. Compared with A*-search-based net-matching routing, experimental results show that our optical router can route all nets without violating any matching constraints while achieving lower total/maximum transmission loss, based on the optical netlists from a state-of-the-art work.
{"title":"On-chip Optical Routing with Waveguide Matching Constraints","authors":"Fu-Yu Chuang, Yao-Wen Chang","doi":"10.1109/iccad51958.2021.9643560","DOIUrl":"https://doi.org/10.1109/iccad51958.2021.9643560","url":null,"abstract":"Photonic integrated circuits (PICs), which introduce optical interconnections for on-chip communication, have become one of the most promising solutions to the increasing requirements with large bandwidth and low-power consumption. Routing techniques for optical interconnections have been proposed to deal with various routing issues in PICs, including transmission losses, thermal reliability, etc. However, in some emerging applications, different optical paths should be closely matched (in terms of the path length, the number of bends, the radius of curvature of bends, and the crossing count) to operate correctly. To the best of our knowledge, no previous work deals with these matching constraints in optical routing. This paper proposes a complete algorithm flow based on an optimal Steiner tree construction and integer linear programming with a hexagonal routing style to handle the matching constraints while minimizing the total transmission loss in a design. Compared with A*-search-based net-matching routing, experimental results show that our optical router can route all nets without violating any matching constraints while achieving lower total/maximum transmission loss, based on the optical netlists from a state-of-the-art work.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116015182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643507
Ching-Cheng Wang, Wai-Kei Mak
In a single-flux-quantum (SFQ) circuit, almost all cells need to receive the clock signal which incurs a high clock routing overhead. Besides, the clock tree of an SFQ circuit requires the insertion of a clock splitter cell at every tree branching point which renders the conventional design flow of placement followed by clock tree synthesis ineffective to obtain a high quality clock tree with low clock skew. To address these issues, we propose a two-stage global placement methodology and a placement refinement algorithm after placement legalization. Our two-stage global placement methodology first applies a conventional global placement algorithm to place the cells in the given SFQ circuit evenly, which is followed by clock tree synthesis and clock splitter insertion, and then performs a second stage of global placement to re-place both the original cells and clock splitters at the same time. In the second global placement stage, the look-ahead legalization technique is used to spread out the original cells and the clock splitters, and the clock tree is re-synthesized several times to obtain an optimized clock tree topology such that there are little overlaps of the clock splitters with the original circuit cells. In addition, the total wirelength of data signals and clock signal is optimized concurrently. After legalizing the placement of all cells, our placement refinement method can be run to further reduce the clock skew. Compared with the previous state-of-the-art work, on average we can reduce the total half-perimeter wirelength and clock skew by 9% and 31%. respectively.
{"title":"A Novel Clock Tree Aware Placement Methodology for Single Flux Quantum (SFQ) Logic Circuits","authors":"Ching-Cheng Wang, Wai-Kei Mak","doi":"10.1109/ICCAD51958.2021.9643507","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643507","url":null,"abstract":"In a single-flux-quantum (SFQ) circuit, almost all cells need to receive the clock signal which incurs a high clock routing overhead. Besides, the clock tree of an SFQ circuit requires the insertion of a clock splitter cell at every tree branching point which renders the conventional design flow of placement followed by clock tree synthesis ineffective to obtain a high quality clock tree with low clock skew. To address these issues, we propose a two-stage global placement methodology and a placement refinement algorithm after placement legalization. Our two-stage global placement methodology first applies a conventional global placement algorithm to place the cells in the given SFQ circuit evenly, which is followed by clock tree synthesis and clock splitter insertion, and then performs a second stage of global placement to re-place both the original cells and clock splitters at the same time. In the second global placement stage, the look-ahead legalization technique is used to spread out the original cells and the clock splitters, and the clock tree is re-synthesized several times to obtain an optimized clock tree topology such that there are little overlaps of the clock splitters with the original circuit cells. In addition, the total wirelength of data signals and clock signal is optimized concurrently. After legalizing the placement of all cells, our placement refinement method can be run to further reduce the clock skew. Compared with the previous state-of-the-art work, on average we can reduce the total half-perimeter wirelength and clock skew by 9% and 31%. respectively.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125998549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643447
J. Lienig, Susann Rothe, Matthias Thiele, N. Rangarajan, M. Ashraf, M. Nabeel, H. Amrouch, O. Sinanoglu, J. Knechtel
The reliable operation of ICs is subject to physical effects like electromigration, thermal and stress migration, negative bias temperature instability, hot-carrier injection, etc. While these effects have been studied thoroughly for IC design, threats of their subtle exploitation are not captured well yet. In this paper, we open up a path for security closure of physical layouts in the face of reliability effects. Toward that end, we first review migration effects in interconnects and aging effects in transistors, along with established and emerging means for handling these effects during IC design. Next, we study security threats arising from these effects; in particular, we cover migration effects-based, disruptive Trojans and aging-exacerbated side-channel leakage. Finally, we outline corresponding strategies for security closure of physical layouts, along with an outline for CAD frameworks.
{"title":"Toward Security Closure in the Face of Reliability Effects ICCAD Special Session Paper","authors":"J. Lienig, Susann Rothe, Matthias Thiele, N. Rangarajan, M. Ashraf, M. Nabeel, H. Amrouch, O. Sinanoglu, J. Knechtel","doi":"10.1109/ICCAD51958.2021.9643447","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643447","url":null,"abstract":"The reliable operation of ICs is subject to physical effects like electromigration, thermal and stress migration, negative bias temperature instability, hot-carrier injection, etc. While these effects have been studied thoroughly for IC design, threats of their subtle exploitation are not captured well yet. In this paper, we open up a path for security closure of physical layouts in the face of reliability effects. Toward that end, we first review migration effects in interconnects and aging effects in transistors, along with established and emerging means for handling these effects during IC design. Next, we study security threats arising from these effects; in particular, we cover migration effects-based, disruptive Trojans and aging-exacerbated side-channel leakage. Finally, we outline corresponding strategies for security closure of physical layouts, along with an outline for CAD frameworks.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124741942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Resistive Random-Access-Memory (ReRAM) crossbar is one of the most promising neural network accelerators, thanks to its in-memory and in-situ analog computing abilities for Matrix Multiplication-and-Accumulations (MACs). Nevertheless, the number of rows and columns of ReRAM cells for concurrent execution of MACs is constrained, resulting in limited in-memory computing throughput. Moreover, it is challenging to deploy Deep Neural Network(DNN) models with large model size in the crossbar, since the sparsity of DNNs cannot be effectively exploited in the crossbar structure. As the countermeasure, we develop a novel ReRAM-based DNN accelerator, named Bit-Transformer, which pays attention to the correlation between the bit-level sparsity and the performance of the ReRAM-based crossbar. We propose a superior bit-flip scheme combined with the exponent-based quantization, which can adaptively flip the bits of the mapped DNNs to release redundant space without sacrificing the accuracy much or incurring much hardware overhead. Meanwhile, we design an architecture that can integrate the techniques to massively shrink the crossbar footprint to be used. In this way, It efficiently leverages the bit-level sparsity for performance gains while reducing the energy consumption of computation. The comprehensive experiments indicate that our Bit-Transformer outperforms prior state-of-the-art designs up to 13 x, 35 x, and 67 x, in terms of energy-efficiency, area-efficiency, and throughput, respectively. Code will be open-source in the camera-ready version.
{"title":"Bit-Transformer: Transforming Bit-level Sparsity into Higher Preformance in ReRAM-based Accelerator","authors":"Fangxin Liu, Wenbo Zhao, Zhezhi He, Zongwu Wang, Yilong Zhao, Yongbiao Chen, Li Jiang","doi":"10.1109/ICCAD51958.2021.9643569","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643569","url":null,"abstract":"Resistive Random-Access-Memory (ReRAM) crossbar is one of the most promising neural network accelerators, thanks to its in-memory and in-situ analog computing abilities for Matrix Multiplication-and-Accumulations (MACs). Nevertheless, the number of rows and columns of ReRAM cells for concurrent execution of MACs is constrained, resulting in limited in-memory computing throughput. Moreover, it is challenging to deploy Deep Neural Network(DNN) models with large model size in the crossbar, since the sparsity of DNNs cannot be effectively exploited in the crossbar structure. As the countermeasure, we develop a novel ReRAM-based DNN accelerator, named Bit-Transformer, which pays attention to the correlation between the bit-level sparsity and the performance of the ReRAM-based crossbar. We propose a superior bit-flip scheme combined with the exponent-based quantization, which can adaptively flip the bits of the mapped DNNs to release redundant space without sacrificing the accuracy much or incurring much hardware overhead. Meanwhile, we design an architecture that can integrate the techniques to massively shrink the crossbar footprint to be used. In this way, It efficiently leverages the bit-level sparsity for performance gains while reducing the energy consumption of computation. The comprehensive experiments indicate that our Bit-Transformer outperforms prior state-of-the-art designs up to 13 x, 35 x, and 67 x, in terms of energy-efficiency, area-efficiency, and throughput, respectively. Code will be open-source in the camera-ready version.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126901899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643589
Haoxing Ren, Saad Godil, Brucek Khailany, Robert Kirby, Haiguang Liao, S. Nath, Jonathan Raiman, Rajarshi Roy
Reinforcement learning (RL) has gained attention recently as an optimization algorithm for chip design. This method treats many chip design problems as Markov decision problems (MDPs), where design optimization objectives are converted into rewards given by the environment and design variables are converted into actions provided to the environment. Some recent examples include applications of RL to macro placement and standard cell layout routing. We believe RL can be applied to nearly all aspects of VLSI implementation flows, since many VLSI implementation problems are often NP-complete and state-of-art algorithms cannot be guaranteed to be optimal. With enough training data, it is possible to achieve better results with RL. In this paper we review recent advances in applying RL to VLSI implementation problems such as cell layout, synthesis, placement, routing and parameter tuning. We discuss the challenges of applying RL to VLSI implementation flows and propose future research directions for overcoming these challenges.
{"title":"Optimizing VLSI Implementation with Reinforcement Learning - ICCAD Special Session Paper","authors":"Haoxing Ren, Saad Godil, Brucek Khailany, Robert Kirby, Haiguang Liao, S. Nath, Jonathan Raiman, Rajarshi Roy","doi":"10.1109/ICCAD51958.2021.9643589","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643589","url":null,"abstract":"Reinforcement learning (RL) has gained attention recently as an optimization algorithm for chip design. This method treats many chip design problems as Markov decision problems (MDPs), where design optimization objectives are converted into rewards given by the environment and design variables are converted into actions provided to the environment. Some recent examples include applications of RL to macro placement and standard cell layout routing. We believe RL can be applied to nearly all aspects of VLSI implementation flows, since many VLSI implementation problems are often NP-complete and state-of-art algorithms cannot be guaranteed to be optimal. With enough training data, it is possible to achieve better results with RL. In this paper we review recent advances in applying RL to VLSI implementation problems such as cell layout, synthesis, placement, routing and parameter tuning. We discuss the challenges of applying RL to VLSI implementation flows and propose future research directions for overcoming these challenges.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126347718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643506
E. Trommer, Bernd Waschneck, Akash Kumar
Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent the indexing overhead of sparse networks for the coming generation of Single Instruction, Multiple Data (SIMD)-capable microcontrollers. From this, we develop Delta-Compressed Storage Row (dCSR), a storage format for sparse matrices that allows for both low overhead storage and fast inference on embedded systems with wide SIMD units. We demonstrate our method on an ARM Cortex-M55 MCU prototype with M-Profile Vector Extension (MVE). A comparison of memory consumption and throughput shows that our method achieves competitive compression ratios and increases throughput over dense methods by up to $2.9times$ for sparse matrix-vector multiplication (SpMV)-based kernels and $1.06times$ for sparse matrix-matrix multiplication (SpMM). This is accomplished through handling the generation of index information directly in the SIMD unit, leading to an increase in effective memory bandwidth.
{"title":"dCSR: A Memory-Efficient Sparse Matrix Representation for Parallel Neural Network Inference","authors":"E. Trommer, Bernd Waschneck, Akash Kumar","doi":"10.1109/ICCAD51958.2021.9643506","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643506","url":null,"abstract":"Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent the indexing overhead of sparse networks for the coming generation of Single Instruction, Multiple Data (SIMD)-capable microcontrollers. From this, we develop Delta-Compressed Storage Row (dCSR), a storage format for sparse matrices that allows for both low overhead storage and fast inference on embedded systems with wide SIMD units. We demonstrate our method on an ARM Cortex-M55 MCU prototype with M-Profile Vector Extension (MVE). A comparison of memory consumption and throughput shows that our method achieves competitive compression ratios and increases throughput over dense methods by up to $2.9times$ for sparse matrix-vector multiplication (SpMV)-based kernels and $1.06times$ for sparse matrix-matrix multiplication (SpMM). This is accomplished through handling the generation of index information directly in the SIMD unit, leading to an increase in effective memory bandwidth.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122838414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}