State-of-the-art convolutional neural network (CNN) accelerators are typically communication-dominate architectures. To reduce the energy consumption of data accesses and also to maintain the high performance, researches have adopted large amounts of on-chip register resources and proposed various methods to concentrate communication on on-chip register accesses. As a result, the on-chip register accesses become the energy bottleneck. To further reduce the energy consumption, in this work we propose an in-SRAM accumulation architecture to replace the conventional register files and digital accumulators in the processing elements of CNN accelerators. Compared with the existing in-SRAM computing approaches (which may not be targeted at CNN accelerators), the presented in-SRAM computing architecture not only realizes in-memory accumulation, but also solves the structure contention problem which occurs frequently when embedding in-memory architectures into CNN accelerators. HSPICE simulation results based on the 45nm technology demonstrate that with the proposed in-SRAM accumulator, the overall energy efficiency of a state-of-the-art communication-optimal CNN accelerator is increased by 29% on average.
{"title":"Energy-Efficient In-SRAM Accumulation for CMOS-based CNN Accelerators","authors":"Wanqian Li, Yinhe Han, Xiaoming Chen","doi":"10.1145/3526241.3530319","DOIUrl":"https://doi.org/10.1145/3526241.3530319","url":null,"abstract":"State-of-the-art convolutional neural network (CNN) accelerators are typically communication-dominate architectures. To reduce the energy consumption of data accesses and also to maintain the high performance, researches have adopted large amounts of on-chip register resources and proposed various methods to concentrate communication on on-chip register accesses. As a result, the on-chip register accesses become the energy bottleneck. To further reduce the energy consumption, in this work we propose an in-SRAM accumulation architecture to replace the conventional register files and digital accumulators in the processing elements of CNN accelerators. Compared with the existing in-SRAM computing approaches (which may not be targeted at CNN accelerators), the presented in-SRAM computing architecture not only realizes in-memory accumulation, but also solves the structure contention problem which occurs frequently when embedding in-memory architectures into CNN accelerators. HSPICE simulation results based on the 45nm technology demonstrate that with the proposed in-SRAM accumulator, the overall energy efficiency of a state-of-the-art communication-optimal CNN accelerator is increased by 29% on average.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124096202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Immanuel Gubbi, Sayed Aresh Beheshti-Shirazi, T. Sheaves, Soheil Salehi, Sai Manoj Pudukotai Dinakarrao, S. Rafatirad, Avesta Sasan, H. Homayoun
An increase in demand for semiconductor ICs, recent advancements in machine learning, and the slowing down of Moore's law have all contributed to the increased interest in using Machine Learning (ML) to enhance Electronic Design Automation (EDA) and Computer-Aided Design (CAD) tools and processes. This paper provides a comprehensive survey of available EDA and CAD tools, methods, processes, and techniques for Integrated Circuits (ICs) that use machine learning algorithms. The ML-based EDA/CAD tools are classified based on the IC design steps. They are utilized in Synthesis, Physical Design (Floorplanning, Placement, Clock Tree Synthesis, Routing), IR drop analysis, Static Timing Analysis (STA), Design for Test (DFT), Power Delivery Network analysis, and Sign-off. The current landscape of ML-based VLSI-CAD tools, current trends, and future perspectives of ML in VLSI-CAD are also discussed.
{"title":"Survey of Machine Learning for Electronic Design Automation","authors":"Kevin Immanuel Gubbi, Sayed Aresh Beheshti-Shirazi, T. Sheaves, Soheil Salehi, Sai Manoj Pudukotai Dinakarrao, S. Rafatirad, Avesta Sasan, H. Homayoun","doi":"10.1145/3526241.3530834","DOIUrl":"https://doi.org/10.1145/3526241.3530834","url":null,"abstract":"An increase in demand for semiconductor ICs, recent advancements in machine learning, and the slowing down of Moore's law have all contributed to the increased interest in using Machine Learning (ML) to enhance Electronic Design Automation (EDA) and Computer-Aided Design (CAD) tools and processes. This paper provides a comprehensive survey of available EDA and CAD tools, methods, processes, and techniques for Integrated Circuits (ICs) that use machine learning algorithms. The ML-based EDA/CAD tools are classified based on the IC design steps. They are utilized in Synthesis, Physical Design (Floorplanning, Placement, Clock Tree Synthesis, Routing), IR drop analysis, Static Timing Analysis (STA), Design for Test (DFT), Power Delivery Network analysis, and Sign-off. The current landscape of ML-based VLSI-CAD tools, current trends, and future perspectives of ML in VLSI-CAD are also discussed.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122289314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sreenitha Kasarapu, Sanket Shukla, Rakibul Hassan, Avesta Sasan, H. Homayoun, Sai Manoj Pudukotai Dinakarrao
One of the pivotal security threats for embedded computing systems is malicious softwarea.k.a malware. With efficiency and efficacy, Machine Learning (ML) has been widely adopted for malware detection in recent times. Despite being efficient, the existing techniques require updating the ML model frequently with newer benign and malware samples for training and modeling an efficient malware detector. Furthermore, such constraints limit the detection of emerging malware samples due to the lack of sufficient malware samples required for efficient training. To address such concerns, we introduce a code-aware data generation-based few-shot learning technique. CAD-FSL generates multiple mutated samples of the limitedly seen malware for efficient malware detection. Loss minimization ensures that the generated samples closely mimic the limitedly seen malware, restore malware functionality and mitigate the impractical samples. Such developed synthetic malware is incorporated into the training set to formulate the model that can efficiently detect the emerging malware despite having limited (few-shot) exposure. The experimental results demonstrate that with the proposed "Code-Aware Data Generation" technique, we detect malware with 90% accuracy, which is approximately 9% higher while training classifiers with only limitedly available training data.
{"title":"CAD-FSL: Code-Aware Data Generation based Few-Shot Learning for Efficient Malware Detection","authors":"Sreenitha Kasarapu, Sanket Shukla, Rakibul Hassan, Avesta Sasan, H. Homayoun, Sai Manoj Pudukotai Dinakarrao","doi":"10.1145/3526241.3530825","DOIUrl":"https://doi.org/10.1145/3526241.3530825","url":null,"abstract":"One of the pivotal security threats for embedded computing systems is malicious softwarea.k.a malware. With efficiency and efficacy, Machine Learning (ML) has been widely adopted for malware detection in recent times. Despite being efficient, the existing techniques require updating the ML model frequently with newer benign and malware samples for training and modeling an efficient malware detector. Furthermore, such constraints limit the detection of emerging malware samples due to the lack of sufficient malware samples required for efficient training. To address such concerns, we introduce a code-aware data generation-based few-shot learning technique. CAD-FSL generates multiple mutated samples of the limitedly seen malware for efficient malware detection. Loss minimization ensures that the generated samples closely mimic the limitedly seen malware, restore malware functionality and mitigate the impractical samples. Such developed synthetic malware is incorporated into the training set to formulate the model that can efficiently detect the emerging malware despite having limited (few-shot) exposure. The experimental results demonstrate that with the proposed \"Code-Aware Data Generation\" technique, we detect malware with 90% accuracy, which is approximately 9% higher while training classifiers with only limitedly available training data.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122960543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present RVVRadar, a framework to support the programmer over the four major steps of development, verification, measurement, and evaluation during the vectorization process of an algorithm. We demonstrate the advantages of RVVRadar for vectorization on several practical relevant algorithms. This includes in particular the widely-used libpng library where we vectorized all filter computations resulting in speedups of up to 5.43. We made RVVRadar as well as all benchmarks (including the RVV-based libpng) open source.
{"title":"RVVRadar: A Framework for Supporting the Programmer in Vectorization for RISC-V","authors":"Lucas Klemmer, Manfred Schlägl, Daniel Große","doi":"10.1145/3526241.3530388","DOIUrl":"https://doi.org/10.1145/3526241.3530388","url":null,"abstract":"In this paper, we present RVVRadar, a framework to support the programmer over the four major steps of development, verification, measurement, and evaluation during the vectorization process of an algorithm. We demonstrate the advantages of RVVRadar for vectorization on several practical relevant algorithms. This includes in particular the widely-used libpng library where we vectorized all filter computations resulting in speedups of up to 5.43. We made RVVRadar as well as all benchmarks (including the RVV-based libpng) open source.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116685170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 2A: Hardware Security","authors":"K. Gaj","doi":"10.1145/3542684","DOIUrl":"https://doi.org/10.1145/3542684","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125585051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Improving the quality of a test set without storing additional tests is important when a higher fault coverage is required but test data storage is limited. Such an improvement can be achieved by using every test in a base test set to apply several different tests. In this paper, we consider the case where a base test set for basic faults is used for detecting more complex faults. Depending on the operators used for producing different applied tests, the number of options available for applied tests can be large. In such cases it is necessary to select a subset of applied tests from all the available ones. We develop algorithms for solving this problem in two scenarios. In the first scenario additional coverage is required for a small subset of faults associated with specific gates. In the second scenario additional coverage is required for the entire circuit. Experimental results are presented for benchmark circuits and logic blocks of the OpenSPARC T1 microprocessor to demonstrate the effectiveness of these algorithms.
{"title":"Algorithms for the Selection of Applied Tests when a Stored Test Produces Many Applied Tests","authors":"Hari Addepalli, I. Pomeranz","doi":"10.1145/3526241.3530359","DOIUrl":"https://doi.org/10.1145/3526241.3530359","url":null,"abstract":"Improving the quality of a test set without storing additional tests is important when a higher fault coverage is required but test data storage is limited. Such an improvement can be achieved by using every test in a base test set to apply several different tests. In this paper, we consider the case where a base test set for basic faults is used for detecting more complex faults. Depending on the operators used for producing different applied tests, the number of options available for applied tests can be large. In such cases it is necessary to select a subset of applied tests from all the available ones. We develop algorithms for solving this problem in two scenarios. In the first scenario additional coverage is required for a small subset of faults associated with specific gates. In the second scenario additional coverage is required for the entire circuit. Experimental results are presented for benchmark circuits and logic blocks of the OpenSPARC T1 microprocessor to demonstrate the effectiveness of these algorithms.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128801453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a novel merged-accumulation-based approximate MAC (multiply-accumulate) unit, MEGA-MAC, for accelerating error-resilient applications. MEGA-MAC utilizes a novel rearrangement and compression strategy in the multiplication stage and a novel approximate "carry predicting adder" (CPA) in the accumulation stage. Addition and multiplication operations are merged, which reduces the delay. MEGA-MAC provides knobs to exercise a tradeoff between accuracy and resource overhead. Compared to the accurate MAC unit, MEGA-MAC(8,6) (i.e., a MEGA-MAC unit with a chunk size of 6 bits, operating on 8-bit input operands) reduces the power-delay-product (PDP) by 49.4%, while incurring a mean error percentage of only 4.2%. Compared to state-of-art approximate MAC units, MEGA-MAC achieves a better balance between resource-saving and accuracy-loss. The source code is available at https://sites.google.com/view/mega-mac-approximate-mac-unit/.
{"title":"MEGA-MAC: A Merged Accumulation based Approximate MAC Unit for Error Resilient Applications","authors":"Vishesh Mishra, Sparsh Mittal, Saurabh Singh, Divy Pandey, Rekha Singhal","doi":"10.1145/3526241.3530384","DOIUrl":"https://doi.org/10.1145/3526241.3530384","url":null,"abstract":"This paper proposes a novel merged-accumulation-based approximate MAC (multiply-accumulate) unit, MEGA-MAC, for accelerating error-resilient applications. MEGA-MAC utilizes a novel rearrangement and compression strategy in the multiplication stage and a novel approximate \"carry predicting adder\" (CPA) in the accumulation stage. Addition and multiplication operations are merged, which reduces the delay. MEGA-MAC provides knobs to exercise a tradeoff between accuracy and resource overhead. Compared to the accurate MAC unit, MEGA-MAC(8,6) (i.e., a MEGA-MAC unit with a chunk size of 6 bits, operating on 8-bit input operands) reduces the power-delay-product (PDP) by 49.4%, while incurring a mean error percentage of only 4.2%. Compared to state-of-art approximate MAC units, MEGA-MAC achieves a better balance between resource-saving and accuracy-loss. The source code is available at https://sites.google.com/view/mega-mac-approximate-mac-unit/.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125822538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network-on-chip (NoC) architectures rely on buffers to store flits to cope with contention for router resources during packet switching. Recently, reversible multi-function channel (RMC) buffers have been proposed to simultaneously reduce power and enable adaptive NoC buffering between adjacent routers. While adaptive buffering can improve NoC performance by maximizing buffer utilization, controlling the RMC buffer allocations requires a congestion-aware, scalable, and proactive policy. In this work, we present RACE, a novel reinforcement learning (RL) framework that utilizes better awareness of network congestion and a new reward metric ("falsefulls") to help guide the RL agent towards better RMC buffer control decisions. We show that RACE reduces NoC latency by up to 48.9%, and energy consumption by up to 47.1% against state-of-the-art NoC buffer control policies.
{"title":"RACE: A Reinforcement Learning Framework for Improved Adaptive Control of NoC Channel Buffers","authors":"Kamil Khan, S. Pasricha, R. Kim","doi":"10.1145/3526241.3530335","DOIUrl":"https://doi.org/10.1145/3526241.3530335","url":null,"abstract":"Network-on-chip (NoC) architectures rely on buffers to store flits to cope with contention for router resources during packet switching. Recently, reversible multi-function channel (RMC) buffers have been proposed to simultaneously reduce power and enable adaptive NoC buffering between adjacent routers. While adaptive buffering can improve NoC performance by maximizing buffer utilization, controlling the RMC buffer allocations requires a congestion-aware, scalable, and proactive policy. In this work, we present RACE, a novel reinforcement learning (RL) framework that utilizes better awareness of network congestion and a new reward metric (\"falsefulls\") to help guide the RL agent towards better RMC buffer control decisions. We show that RACE reduces NoC latency by up to 48.9%, and energy consumption by up to 47.1% against state-of-the-art NoC buffer control policies.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"8 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120844317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Embedded computing systems are pervasive in our everyday lives, imparting digital intelligence to a variety of electronic platforms used in our vehicles, smart appliances, wearables, mobile devices, and computers. The need to train the next generation of embedded systems designers and engineers with relevant skills across hardware, software, and their co-design remains pressing today. This paper describes the evolution of embedded systems education over the past two decades and challenges facing the designers and instructors of embedded systems curricula in the 2020s. Reflections from over a decade of teaching the design of embedded computing systems are presented, with insights on strategies that show promise to address these challenges. Lastly, some important future directions in embedded systems education are highlighted.
{"title":"Embedded Systems Education in the 2020s: Challenges, Reflections, and Future Directions","authors":"S. Pasricha","doi":"10.1145/3526241.3530348","DOIUrl":"https://doi.org/10.1145/3526241.3530348","url":null,"abstract":"Embedded computing systems are pervasive in our everyday lives, imparting digital intelligence to a variety of electronic platforms used in our vehicles, smart appliances, wearables, mobile devices, and computers. The need to train the next generation of embedded systems designers and engineers with relevant skills across hardware, software, and their co-design remains pressing today. This paper describes the evolution of embedded systems education over the past two decades and challenges facing the designers and instructors of embedded systems curricula in the 2020s. Reflections from over a decade of teaching the design of embedded computing systems are presented, with insights on strategies that show promise to address these challenges. Lastly, some important future directions in embedded systems education are highlighted.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121822876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parameter quantization in convolutional neural networks (CNNs) can help generate efficient models with lower memory footprint and computational complexity. But, homogeneous quantization can result in significant degradation of CNN model accuracy. In contrast, heterogeneous quantization represents a promising approach to realize compact, quantized models with higher inference accuracies. In this paper, we propose HQNNA, a CNN accelerator based on non-coherent silicon photonics that can accelerate both homogeneously quantized and heterogeneously quantized CNN models. Our analyses show that HQNNA achieves up to 73.8x better energy-per-bit and 159.5x better throughput-energy efficiency than state-of-the-art photonic CNN accelerators
{"title":"A Silicon Photonic Accelerator for Convolutional Neural Networks with Heterogeneous Quantization","authors":"Febin P. Sunny, M. Nikdast, S. Pasricha","doi":"10.1145/3526241.3530364","DOIUrl":"https://doi.org/10.1145/3526241.3530364","url":null,"abstract":"Parameter quantization in convolutional neural networks (CNNs) can help generate efficient models with lower memory footprint and computational complexity. But, homogeneous quantization can result in significant degradation of CNN model accuracy. In contrast, heterogeneous quantization represents a promising approach to realize compact, quantized models with higher inference accuracies. In this paper, we propose HQNNA, a CNN accelerator based on non-coherent silicon photonics that can accelerate both homogeneously quantized and heterogeneously quantized CNN models. Our analyses show that HQNNA achieves up to 73.8x better energy-per-bit and 159.5x better throughput-energy efficiency than state-of-the-art photonic CNN accelerators","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124393681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}