Sethu Jose, J. Sampson, N. Vijaykrishnan, M. Kandemir
With the growing popularity of the Internet of Things (IoTs), emerging applications demand that edge nodes provide higher computational capabilities and long operation times while requiring minimal maintenance. Ambient energy harvesting is a promising alternative to batteries, but only if the hardware and software are optimized for the intermittent nature of the power source. At the same time, many compute tasks in IoT workloads involve executing decomposable kernels that may have application-dependent accuracy requirements. In this work, we introduce a hardware-software co-optimization framework for such kernels that aim to achieve maximum forward progress while running on energy harvesting Non-Volatile Processors (NVP). Using this framework, we develop an FFT and a convolution accelerator that computes up to 3.2x faster, while consuming 5.4x less energy, compared to a baseline energy-harvesting system. With our accuracy-aware scheduling strategy, the approximate computing enabled by this framework delivers on average 6.2x energy reduction and 3.2x speedup by sacrificing minimal accuracy of up to 6.9%.
{"title":"A Scheduling Framework for Decomposable Kernels on Energy Harvesting IoT Edge Nodes","authors":"Sethu Jose, J. Sampson, N. Vijaykrishnan, M. Kandemir","doi":"10.1145/3526241.3530350","DOIUrl":"https://doi.org/10.1145/3526241.3530350","url":null,"abstract":"With the growing popularity of the Internet of Things (IoTs), emerging applications demand that edge nodes provide higher computational capabilities and long operation times while requiring minimal maintenance. Ambient energy harvesting is a promising alternative to batteries, but only if the hardware and software are optimized for the intermittent nature of the power source. At the same time, many compute tasks in IoT workloads involve executing decomposable kernels that may have application-dependent accuracy requirements. In this work, we introduce a hardware-software co-optimization framework for such kernels that aim to achieve maximum forward progress while running on energy harvesting Non-Volatile Processors (NVP). Using this framework, we develop an FFT and a convolution accelerator that computes up to 3.2x faster, while consuming 5.4x less energy, compared to a baseline energy-harvesting system. With our accuracy-aware scheduling strategy, the approximate computing enabled by this framework delivers on average 6.2x energy reduction and 3.2x speedup by sacrificing minimal accuracy of up to 6.9%.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130506607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brain-inspired Hyperdimensional (HD) computing is a new machine learning approach that leverages simple and highly parallelizable operations. Unfortunately, none of the published HD computing algorithms to date have been able to accurately classify more complex image datasets, such as CIFAR100. In this work, we propose HDnn-PIM, that implements both feature extraction and HD-based classification for complex images by using processing-in-memory. We compare HDnn-PIM with HD-only and CNN implementations for various image datasets. HDnn-PIM achieves 52.4% higher accuracy as compared to pure HD computing. It also gains 1.2% accuracy improvement over state-of-the-art CNNs, but with 3.63x smaller memory footprint and 1.53x less MAC operations. Furthermore, HDnn-PIM is 3.6x-223x faster than RTX 3090 GPU, and 3.7x more energy efficient than state-of-the-art FloatPIM.
{"title":"HDnn-PIM: Efficient in Memory Design of Hyperdimensional Computing with Feature Extraction","authors":"Arpan Dutta, Saransh Gupta, Behnam Khaleghi, Rishikanth Chandrasekaran, Weihong Xu, T. Simunic","doi":"10.1145/3526241.3530331","DOIUrl":"https://doi.org/10.1145/3526241.3530331","url":null,"abstract":"Brain-inspired Hyperdimensional (HD) computing is a new machine learning approach that leverages simple and highly parallelizable operations. Unfortunately, none of the published HD computing algorithms to date have been able to accurately classify more complex image datasets, such as CIFAR100. In this work, we propose HDnn-PIM, that implements both feature extraction and HD-based classification for complex images by using processing-in-memory. We compare HDnn-PIM with HD-only and CNN implementations for various image datasets. HDnn-PIM achieves 52.4% higher accuracy as compared to pure HD computing. It also gains 1.2% accuracy improvement over state-of-the-art CNNs, but with 3.63x smaller memory footprint and 1.53x less MAC operations. Furthermore, HDnn-PIM is 3.6x-223x faster than RTX 3090 GPU, and 3.7x more energy efficient than state-of-the-art FloatPIM.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125424162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stochastic computing is a paradigm in which logical operations are performed on randomly generated bit streams. Complex arithmetic operations can be performed by simple logic circuits, with a much smaller area footprint than conventional binary counterparts. However, the random or pseudorandom sources required to generate the bit streams are costly in terms of area and offset the gains. Also, due to randomness, the computation is not precise, which limits the applicability of the paradigm. Most importantly, to achieve reasonable accuracy, high latency is necessitated. Recently, deterministic approaches to stochastic computing have been proposed. They demonstrated that randomness is not a requirement. By structuring the computation deterministically, the result is exact and the latency is greatly reduced. However, despite being an improvement over conventional stochastic techniques, the latency increases quadratically with each level of logic. Beyond a few levels of logic, it becomes unmanageable. In this paper, we present a method for approximating the results of their deterministic method, with latency that only increases linearly with each level. The improvement comes at the cost of additional logic, but we demonstrate that the increase in area scales with √n, where n is the equivalent number of binary bits of precision. The new approach is general, efficient, composable, and applicable to all arithmetic operations performed with stochastic logic.
{"title":"A Scalable, Deterministic Approach to Stochastic Computing","authors":"Y. Kiran, Marc D. Riedel","doi":"10.1145/3526241.3530344","DOIUrl":"https://doi.org/10.1145/3526241.3530344","url":null,"abstract":"Stochastic computing is a paradigm in which logical operations are performed on randomly generated bit streams. Complex arithmetic operations can be performed by simple logic circuits, with a much smaller area footprint than conventional binary counterparts. However, the random or pseudorandom sources required to generate the bit streams are costly in terms of area and offset the gains. Also, due to randomness, the computation is not precise, which limits the applicability of the paradigm. Most importantly, to achieve reasonable accuracy, high latency is necessitated. Recently, deterministic approaches to stochastic computing have been proposed. They demonstrated that randomness is not a requirement. By structuring the computation deterministically, the result is exact and the latency is greatly reduced. However, despite being an improvement over conventional stochastic techniques, the latency increases quadratically with each level of logic. Beyond a few levels of logic, it becomes unmanageable. In this paper, we present a method for approximating the results of their deterministic method, with latency that only increases linearly with each level. The improvement comes at the cost of additional logic, but we demonstrate that the increase in area scales with √n, where n is the equivalent number of binary bits of precision. The new approach is general, efficient, composable, and applicable to all arithmetic operations performed with stochastic logic.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"514 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123201919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 7A: Special Session - 3: Machine Learning-Aided Computer-Aided Design","authors":"Sai Manoj Pudukotai Dinakarrao","doi":"10.1145/3542694","DOIUrl":"https://doi.org/10.1145/3542694","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125136345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 3A: VLSI Design + VLSI Circuits and Power Aware Design 1","authors":"S. Mohanty","doi":"10.1145/3542686","DOIUrl":"https://doi.org/10.1145/3542686","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121750945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Kumar, Anjum Riaz, Yamuna Prasad, Satyadev Ahlawat
The IEEE 1687 standard, which is commonly used for efficient access of on-chip instruments, could be exploited by an intruder and thus needs to be secured. One of the techniques to alleviate the vulnerability of 1687 network is to use a secure access protocol that is based on licensed access software, Chip ID and locking SIB. A licensed access software is generally used to gain control of the embedded instruments and use them as per requirement. In this paper, a successful attack using various machine learning algorithms has been instigated on secure access protocol scheme. It is demonstrated that machine learning algorithms have the potential of breaching the secure communication between the access software and the board and hence access the sensitive instruments. Furthermore, Random Forest significantly outperforms the other models in terms of breaking the security.
{"title":"On Attacking Locking SIB based IJTAG Architecture","authors":"G. Kumar, Anjum Riaz, Yamuna Prasad, Satyadev Ahlawat","doi":"10.1145/3526241.3530370","DOIUrl":"https://doi.org/10.1145/3526241.3530370","url":null,"abstract":"The IEEE 1687 standard, which is commonly used for efficient access of on-chip instruments, could be exploited by an intruder and thus needs to be secured. One of the techniques to alleviate the vulnerability of 1687 network is to use a secure access protocol that is based on licensed access software, Chip ID and locking SIB. A licensed access software is generally used to gain control of the embedded instruments and use them as per requirement. In this paper, a successful attack using various machine learning algorithms has been instigated on secure access protocol scheme. It is demonstrated that machine learning algorithms have the potential of breaching the secure communication between the access software and the board and hence access the sensitive instruments. Furthermore, Random Forest significantly outperforms the other models in terms of breaking the security.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134460618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ransomware has become a serious threat in the cyberspace. Existing software pattern-based malware detectors are specific for certain ransomware and may not capture new variants. Recognizing a common essential behavior of ransomware - employing local cryptographic software for malicious encryption and therefore leaving footprints on the victim machine's caches, this work proposes an anti-ransomware methodology, Ran$Net, based on hardware activities. It consists of a passive cache monitor to log suspicious cache activities, and a follow-on non-profiled deep learning analysis strategy to retrieve the secret cryptographic key from the timing traces generated by the monitor. We implement the first of its kind tool to combat an open-source ransomware and successfully recover the secret key.
{"title":"Ran$Net: An Anti-Ransomware Methodology based on Cache Monitoring and Deep Learning","authors":"Xiang Zhang, Ziyue Zhang, Ruyi Ding, Gongye Cheng, A. Ding, Yunsi Fei","doi":"10.1145/3526241.3530830","DOIUrl":"https://doi.org/10.1145/3526241.3530830","url":null,"abstract":"Ransomware has become a serious threat in the cyberspace. Existing software pattern-based malware detectors are specific for certain ransomware and may not capture new variants. Recognizing a common essential behavior of ransomware - employing local cryptographic software for malicious encryption and therefore leaving footprints on the victim machine's caches, this work proposes an anti-ransomware methodology, Ran$Net, based on hardware activities. It consists of a passive cache monitor to log suspicious cache activities, and a follow-on non-profiled deep learning analysis strategy to retrieve the secret cryptographic key from the timing traces generated by the monitor. We implement the first of its kind tool to combat an open-source ransomware and successfully recover the secret key.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133793559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aibin Yan, Zhihui He, Jing Xiang, Jie Cui, Yong Zhou, Zhengfeng Huang, P. Girard, X. Wen
Aggressive scaling of CMOS technologies requires to pay attention to the reliability issues of circuits. This paper presents two highly reliable RHBD 10T and 12T SRAM cells, which can protect against single-node upsets (SNUs) and double-node upsets (DNUs). The 10T cell mainly consists of two cross-coupled input-split inverters and the cell can robustly keep stored values through a feedback mechanism among its internal nodes. It also has a low cost in terms of area and power consumption, since it uses only a few transistors. Based on the 10T cell, a 12T cell is proposed that uses four parallel access transistors. The 12T cell has a reduced read/write access time with the same soft error tolerance when compared to the 10T cell. Simulation results demonstrate that the proposed cells can recover from SNUs and a part of DNUs. Moreover, compared with the state-of-the-art hardened SRAM cells, the proposed 10T cell can save 28.59% write access time, 55.83% read access time, and 4.46% power dissipation at the cost of 4.04% silicon area on average.
{"title":"Two 0.8 V, Highly Reliable RHBD 10T and 12T SRAM Cells for Aerospace Applications","authors":"Aibin Yan, Zhihui He, Jing Xiang, Jie Cui, Yong Zhou, Zhengfeng Huang, P. Girard, X. Wen","doi":"10.1145/3526241.3530312","DOIUrl":"https://doi.org/10.1145/3526241.3530312","url":null,"abstract":"Aggressive scaling of CMOS technologies requires to pay attention to the reliability issues of circuits. This paper presents two highly reliable RHBD 10T and 12T SRAM cells, which can protect against single-node upsets (SNUs) and double-node upsets (DNUs). The 10T cell mainly consists of two cross-coupled input-split inverters and the cell can robustly keep stored values through a feedback mechanism among its internal nodes. It also has a low cost in terms of area and power consumption, since it uses only a few transistors. Based on the 10T cell, a 12T cell is proposed that uses four parallel access transistors. The 12T cell has a reduced read/write access time with the same soft error tolerance when compared to the 10T cell. Simulation results demonstrate that the proposed cells can recover from SNUs and a part of DNUs. Moreover, compared with the state-of-the-art hardened SRAM cells, the proposed 10T cell can save 28.59% write access time, 55.83% read access time, and 4.46% power dissipation at the cost of 4.04% silicon area on average.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115365555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most of the applications of modern day VLSI designs are approaching towards energy efficient and high speed computing solutions. Approximate computing is considered a suitable design methodology that satisfies the current requirements of hardware and performance metrics without compromising on the outcome significantly. Many of the arithmetic operations are realized using approximate computing techniques, and many successful implementations are reported at system level designs. However divider operations in general are rarely realized in hardware and this needs much attention considering the surge in neural networks implementation in hardware. In this paper, a novel approximate divider is proposed which is not only characterized to have better accuracy and hardware efficient when compared to the other accurate dividers. The proposed divider is built on logarithmic divider and approximates the exponent part to achieve the desired hardware characteristics. The proposed 8-bit, and 16-bit divider design were realized in 45-NM CMOS technology for different input and output data format including integer, fixed-point, and floating-point. The proposed divider was characterized for error and hardware metrics and compared with other dividers. The novel divider was validated on K-means color quantization algorithm, showcasing improved quantization results.
{"title":"LEAD: Logarithmic Exponent Approximate Divider For Image Quantization Application","authors":"Omkar G. Ratnaparkhi, M. Rao","doi":"10.1145/3526241.3530323","DOIUrl":"https://doi.org/10.1145/3526241.3530323","url":null,"abstract":"Most of the applications of modern day VLSI designs are approaching towards energy efficient and high speed computing solutions. Approximate computing is considered a suitable design methodology that satisfies the current requirements of hardware and performance metrics without compromising on the outcome significantly. Many of the arithmetic operations are realized using approximate computing techniques, and many successful implementations are reported at system level designs. However divider operations in general are rarely realized in hardware and this needs much attention considering the surge in neural networks implementation in hardware. In this paper, a novel approximate divider is proposed which is not only characterized to have better accuracy and hardware efficient when compared to the other accurate dividers. The proposed divider is built on logarithmic divider and approximates the exponent part to achieve the desired hardware characteristics. The proposed 8-bit, and 16-bit divider design were realized in 45-NM CMOS technology for different input and output data format including integer, fixed-point, and floating-point. The proposed divider was characterized for error and hardware metrics and compared with other dividers. The novel divider was validated on K-means color quantization algorithm, showcasing improved quantization results.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126231943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 7B: Microelectronic Systems Education","authors":"B. Skromme","doi":"10.1145/3542695","DOIUrl":"https://doi.org/10.1145/3542695","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122436734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}