{"title":"Session details: Session 1B: Emerging Computing and Post-CMOS Technologies","authors":"Deliang Fan","doi":"10.1145/3542683","DOIUrl":"https://doi.org/10.1145/3542683","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124152128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srivalli Boddupalli, Richard Owoputi, Chengwei Duan, T. Choudhury, Sandip Ray
With the proliferation of connectivity and smart computing in vehicles, a new attack surface has emerged that targets subversion of vehicular applications by compromising sensors and communication. A unique feature of these attacks is that they no longer require intrusion into the hardware and software components of the victim vehicle; rather, it is possible to subvert the application by providing wrong or misleading information. We consider the problem of making vehicular systems resilient against these threats. A promising approach is to adapt resiliency solutions based on anomaly detection through Machine Learning. We discuss challenges in making such an approach viable. In particular, we consider the problem of validating such resiliency architectures, the factors that make the problem challenging, and our approaches to address the challenges.
{"title":"Resiliency in Connected Vehicle Applications: Challenges and Approaches for Security Validation","authors":"Srivalli Boddupalli, Richard Owoputi, Chengwei Duan, T. Choudhury, Sandip Ray","doi":"10.1145/3526241.3530832","DOIUrl":"https://doi.org/10.1145/3526241.3530832","url":null,"abstract":"With the proliferation of connectivity and smart computing in vehicles, a new attack surface has emerged that targets subversion of vehicular applications by compromising sensors and communication. A unique feature of these attacks is that they no longer require intrusion into the hardware and software components of the victim vehicle; rather, it is possible to subvert the application by providing wrong or misleading information. We consider the problem of making vehicular systems resilient against these threats. A promising approach is to adapt resiliency solutions based on anomaly detection through Machine Learning. We discuss challenges in making such an approach viable. In particular, we consider the problem of validating such resiliency architectures, the factors that make the problem challenging, and our approaches to address the challenges.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122832238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Aswathy, Sreesiddesh Bhavanasi, A. Sarkar, H. Kapoor
Hybrid memory systems with a combination of DRAM and Non-Volatile Memory (NVM) types can make use of scalability and performance of both NVM and DRAM. Random placement of pages in Phase Change Memory (PCM) with more write accesses incurs higher write latencies. So, migrating write intensive pages from PCM to DRAM helps to reduce execution time and memory response time for applications. Existing techniques mainly focus on selecting the page migration candidate and migrate it immediately when it becomes eligible. This direct migration approach can hamper the response time of regular memory accesses. So, in our paper, we identify migration candidates and in addition, schedule when they can be migrated to DRAM. To realize this, we have used Selection and Run-time Scheduling of page Migration (SRS-Mig), a frame-based scheduling approach for migrations and read/write requests. SRS-Mig reduces migration overhead and guarantees future accesses to migrated pages to yield an improved execution time and memory response time for the applications. Experimental evaluation shows 30% improvement in execution time; 26% improvement memory response time, and considerable energy savings with the existing baseline techniques.
{"title":"SRS-Mig: Selection and Run-time Scheduling of page Migration for improved response time in hybrid PCM-DRAM memories","authors":"N. Aswathy, Sreesiddesh Bhavanasi, A. Sarkar, H. Kapoor","doi":"10.1145/3526241.3530327","DOIUrl":"https://doi.org/10.1145/3526241.3530327","url":null,"abstract":"Hybrid memory systems with a combination of DRAM and Non-Volatile Memory (NVM) types can make use of scalability and performance of both NVM and DRAM. Random placement of pages in Phase Change Memory (PCM) with more write accesses incurs higher write latencies. So, migrating write intensive pages from PCM to DRAM helps to reduce execution time and memory response time for applications. Existing techniques mainly focus on selecting the page migration candidate and migrate it immediately when it becomes eligible. This direct migration approach can hamper the response time of regular memory accesses. So, in our paper, we identify migration candidates and in addition, schedule when they can be migrated to DRAM. To realize this, we have used Selection and Run-time Scheduling of page Migration (SRS-Mig), a frame-based scheduling approach for migrations and read/write requests. SRS-Mig reduces migration overhead and guarantees future accesses to migrated pages to yield an improved execution time and memory response time for the applications. Experimental evaluation shows 30% improvement in execution time; 26% improvement memory response time, and considerable energy savings with the existing baseline techniques.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126526265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang Xiao, Haobo Xu, Xiaoming Chen, Yujie Wang, Yinhe Han
Convolutional Neural Networks (CNNs) achieve state-of-the-art performance for perception tasks at the cost of billions of computational operations. In this paper, we propose a probabilistic prediction processing system, dubbed P3S, to eliminate redundant compute-heavy convolution operations by predicting whether output activations are zero-valued. By exploiting the probability characteristic of Gaussian-like distributed activations and weights in CNNs, P3S calculates the partial convolution across values greater than a standard deviation-related threshold, to predict the ineffectual output activations. P3S skips remaining convolutions and sets outputs to zero in advance if output activations are predicted to be zero. P3S reduces 67% computations within 0.2% accuracy loss and does not even require retraining or fine-tuning CNNs. We further implement a P3S-based CNN accelerator that achieves 2.02x speedup and 2.23x energy efficiency on average over the traditional accelerator. Compared with the state-of-the-art prediction-based accelerator with 3% accuracy degradation, our P$^3$S yields up to 1.49x speedup and 1.69x energy efficiency.
{"title":"P3S: A High Accuracy Probabilistic Prediction Processing System for CNN Acceleration","authors":"Hang Xiao, Haobo Xu, Xiaoming Chen, Yujie Wang, Yinhe Han","doi":"10.1145/3526241.3530322","DOIUrl":"https://doi.org/10.1145/3526241.3530322","url":null,"abstract":"Convolutional Neural Networks (CNNs) achieve state-of-the-art performance for perception tasks at the cost of billions of computational operations. In this paper, we propose a probabilistic prediction processing system, dubbed P3S, to eliminate redundant compute-heavy convolution operations by predicting whether output activations are zero-valued. By exploiting the probability characteristic of Gaussian-like distributed activations and weights in CNNs, P3S calculates the partial convolution across values greater than a standard deviation-related threshold, to predict the ineffectual output activations. P3S skips remaining convolutions and sets outputs to zero in advance if output activations are predicted to be zero. P3S reduces 67% computations within 0.2% accuracy loss and does not even require retraining or fine-tuning CNNs. We further implement a P3S-based CNN accelerator that achieves 2.02x speedup and 2.23x energy efficiency on average over the traditional accelerator. Compared with the state-of-the-art prediction-based accelerator with 3% accuracy degradation, our P$^3$S yields up to 1.49x speedup and 1.69x energy efficiency.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"04 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130520273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rose George Kunthara, Rekha K. James, Simi Zerine Sleeba, John Jose
Network on Chip (NoC) is an effective intercommunication structure used in the design of efficient Tiled Chip Multi Processor (TCMP) systems as they improve system performance manifold. Bufferless NoC has emerged as a popular design choice to address area and energy concerns associated with buffered NoC systems. For low to medium injection rate applications, both bufferless and buffered routers show similar network performance. As the network load rises, network performance of bufferless router based designs deteriorate due to increased deflections. This paper proposes a subnetwork based bufferless design, DAReS, to minimize deflections by redirecting contending flit in one subnetwork to unoccupied productive ports of other subnetwork without incurring any extra cycle delay. From evaluations, we observe that our proposed design approach improves network performance by minimizing deflection rate, power dissipation and shows better throughput in comparison to state-of-the-art bufferless router.
片上网络(Network on Chip, NoC)是设计高效贴片多处理器(TCMP)系统时所采用的一种有效的通信结构,它可以大大提高系统的性能。无缓冲NoC已成为解决与缓冲NoC系统相关的面积和能源问题的流行设计选择。对于低到中等注入速率的应用程序,无缓冲和有缓冲路由器都显示出相似的网络性能。随着网络负载的增加,基于无缓冲路由器设计的网络性能由于增加的偏转而恶化。本文提出了一种基于子网的无缓冲设计(dare),通过在不产生任何额外的周期延迟的情况下,将一个子网中的竞争流量重定向到另一个子网的未占用的生产端口,从而最小化偏转。从评估中,我们观察到我们提出的设计方法通过最小化偏转率,功耗来提高网络性能,并且与最先进的无缓冲路由器相比,显示出更好的吞吐量。
{"title":"DAReS: Deflection Aware Rerouting between Subnetworks in Bufferless On-Chip Networks","authors":"Rose George Kunthara, Rekha K. James, Simi Zerine Sleeba, John Jose","doi":"10.1145/3526241.3530332","DOIUrl":"https://doi.org/10.1145/3526241.3530332","url":null,"abstract":"Network on Chip (NoC) is an effective intercommunication structure used in the design of efficient Tiled Chip Multi Processor (TCMP) systems as they improve system performance manifold. Bufferless NoC has emerged as a popular design choice to address area and energy concerns associated with buffered NoC systems. For low to medium injection rate applications, both bufferless and buffered routers show similar network performance. As the network load rises, network performance of bufferless router based designs deteriorate due to increased deflections. This paper proposes a subnetwork based bufferless design, DAReS, to minimize deflections by redirecting contending flit in one subnetwork to unoccupied productive ports of other subnetwork without incurring any extra cycle delay. From evaluations, we observe that our proposed design approach improves network performance by minimizing deflection rate, power dissipation and shows better throughput in comparison to state-of-the-art bufferless router.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134619423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 4B: VLSI for Machine Learning and Artifical Intelligence 2","authors":"J. Hu","doi":"10.1145/3542689","DOIUrl":"https://doi.org/10.1145/3542689","url":null,"abstract":"","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132294673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional neural networks (CNNs) have demonstrated significant potential across a range of applications due to their superior accuracy. Edge inference, in which inference is performed locally in embedded systems with limited power resources, is researched for its energy efficiency. An approximate encoder is proposed in this study for decreasing switching activity, which minimizes power consumption in CNN accelerators at the edge. The proposed encoder performs approximate encoding based on a pattern matching of a comparison pattern and current data. Software determines the value of the comparison pattern and the availability of the recommended encoder. Experiments with a CIFAR-10 dataset utilizing LeNet5 show that using the suggested encoder, depending upon the comparison pattern, power consumption of a CNN accelerator can be reduced by 21.5% with 1.59% degradation on inference quality.
{"title":"Reducing Power Consumption using Approximate Encoding for CNN Accelerators at the Edge","authors":"Tongxin Yang, Tomoaki Ukezono, Toshinori Sato","doi":"10.1145/3526241.3530315","DOIUrl":"https://doi.org/10.1145/3526241.3530315","url":null,"abstract":"Convolutional neural networks (CNNs) have demonstrated significant potential across a range of applications due to their superior accuracy. Edge inference, in which inference is performed locally in embedded systems with limited power resources, is researched for its energy efficiency. An approximate encoder is proposed in this study for decreasing switching activity, which minimizes power consumption in CNN accelerators at the edge. The proposed encoder performs approximate encoding based on a pattern matching of a comparison pattern and current data. Software determines the value of the comparison pattern and the availability of the recommended encoder. Experiments with a CIFAR-10 dataset utilizing LeNet5 show that using the suggested encoder, depending upon the comparison pattern, power consumption of a CNN accelerator can be reduced by 21.5% with 1.59% degradation on inference quality.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114188369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangqiang Pan, Amr M. S. Tosson, Ningxuan Wang, N. Zhou, Lan Wei
A 2T2R current race (CR) based ternary content addressable memory (TCAM) design is proposed using resistive random-access memory (RRAM) technology. The suggested design adopts a match-line (ML) booster feature in sensing amplifier to improve search speed and tolerance to RRAM switching variations. An SR-latch cascading scheme is presented to further improve the speed and energy efficiency for large TCAM array. Additionally, a same clock phase cascading scheme is proposed to reduce latency in cascading structure, by placing evaluation phase of all stages in the same clock phase. With the suggested ML booster, our 64-bit 1-stage design has speed and energy consumption matching the best performance reported by other emerging non-volatile memory (eNVM) based TCAM design. Our 128-bit 2-stage design also has comparable speed and energy to SRAM-based TCAM design with significantly more compact size (90% reduction) and non-volatility.
{"title":"A Novel 2T2R CR-based TCAM Design for High-speed and Energy-efficient Applications","authors":"Kangqiang Pan, Amr M. S. Tosson, Ningxuan Wang, N. Zhou, Lan Wei","doi":"10.1145/3526241.3530336","DOIUrl":"https://doi.org/10.1145/3526241.3530336","url":null,"abstract":"A 2T2R current race (CR) based ternary content addressable memory (TCAM) design is proposed using resistive random-access memory (RRAM) technology. The suggested design adopts a match-line (ML) booster feature in sensing amplifier to improve search speed and tolerance to RRAM switching variations. An SR-latch cascading scheme is presented to further improve the speed and energy efficiency for large TCAM array. Additionally, a same clock phase cascading scheme is proposed to reduce latency in cascading structure, by placing evaluation phase of all stages in the same clock phase. With the suggested ML booster, our 64-bit 1-stage design has speed and energy consumption matching the best performance reported by other emerging non-volatile memory (eNVM) based TCAM design. Our 128-bit 2-stage design also has comparable speed and energy to SRAM-based TCAM design with significantly more compact size (90% reduction) and non-volatility.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115454695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Replacing cuts in a circuit with configurable lookup-tables (LUTs) that are securely programmed post-fabrication is a logic locking technique that can be used to hide the complete design from an untrusted foundry. In this paper, we study the security of basic LUT-based locking against a set of oracle-less attacks, i.e. attacks that do not have access to a functional oracle of the original circuit. Specifically we perform cut graph/truth-table prediction using deep and graph neural networks with various data encoding strategies. Overall we observe that naive LUT-based locking with small cuts with 2 or 3 inputs may be vulnerable to oracle-less approximation whereas such attacks become less feasible for higher cut sizes. We open source our software for this attack.
{"title":"An Oracle-Less Machine-Learning Attack against Lookup-Table-based Logic Locking","authors":"Kaveh Shamsi, Guangwei Zhao","doi":"10.1145/3526241.3530377","DOIUrl":"https://doi.org/10.1145/3526241.3530377","url":null,"abstract":"Replacing cuts in a circuit with configurable lookup-tables (LUTs) that are securely programmed post-fabrication is a logic locking technique that can be used to hide the complete design from an untrusted foundry. In this paper, we study the security of basic LUT-based locking against a set of oracle-less attacks, i.e. attacks that do not have access to a functional oracle of the original circuit. Specifically we perform cut graph/truth-table prediction using deep and graph neural networks with various data encoding strategies. Overall we observe that naive LUT-based locking with small cuts with 2 or 3 inputs may be vulnerable to oracle-less approximation whereas such attacks become less feasible for higher cut sizes. We open source our software for this attack.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"509 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122628027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alec Vercruysse, M. W. Miller, Joshua Brake, D. Harris
The Fast Fourier Transform (FFT) is one of the most important algorithms of the past century. It presents a way to compute the discrete Fourier transform with a computational complexity of O(N log2(N)). Its structure also provides an excellent example of the power of custom hardware accelerators. However, current tutorial-style papers implementing the FFT are not well-suited for undergraduate students since they are either too vague on important implementation details or use a pipelined architecture which can obscure important fundamental concepts of the accelerator. This paper presents a simple, single-cycle version of an FFT hardware accelerator that can be implemented on an FPGA and is accompanied with source code to easily simulate and synthesize the design available at https://doi.org/10.5281/zenodo.6219524.
{"title":"A Tutorial-style Single-cycle Fast Fourier Transform Processor","authors":"Alec Vercruysse, M. W. Miller, Joshua Brake, D. Harris","doi":"10.1145/3526241.3530329","DOIUrl":"https://doi.org/10.1145/3526241.3530329","url":null,"abstract":"The Fast Fourier Transform (FFT) is one of the most important algorithms of the past century. It presents a way to compute the discrete Fourier transform with a computational complexity of O(N log2(N)). Its structure also provides an excellent example of the power of custom hardware accelerators. However, current tutorial-style papers implementing the FFT are not well-suited for undergraduate students since they are either too vague on important implementation details or use a pipelined architecture which can obscure important fundamental concepts of the accelerator. This paper presents a simple, single-cycle version of an FFT hardware accelerator that can be implemented on an FPGA and is accompanied with source code to easily simulate and synthesize the design available at https://doi.org/10.5281/zenodo.6219524.","PeriodicalId":188228,"journal":{"name":"Proceedings of the Great Lakes Symposium on VLSI 2022","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116852426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}