A half-select disturb-free 11T (HF11T) static random access memory (SRAM) cell with low power, better stability and high speed is presented in this paper. The proposed SRAM cell works well with bit-interleaving design, which enhances soft-error immunity. A comparison of the proposed HF11T cell with other cutting-edge designs such as single-ended HS free 11T (SEHF11T), a shared-pass-gate 11T (SPG11T), data-dependent stack PMOS switching 10T (DSPS10T), a single-ended half-selected robust 12T (HSR12T), and 11T SRAM cells has been made. It exhibits 4.85 × /9.19 × less read delay (TRA) and write delay (TWA), respectively as compared to other considered SRAM cells. It achieves 1.07 × /1.02 × better read and write stability, respectively than the considered SRAM cells. It shows maximum reduction of 1.68 × /4.58 × /94.72 × /9 × /145 × leakage power, read power, write power consumption, read power delay product (PDP) and write PDP respectively, than the considered SRAM cells. In addition, the proposed HF11T cell achieves 10.14 × higher Ion/Ioff ratio than the other compared cells. These improvements come with a trade-off, resulting in 1.13 × more TRA compared to SPG11T. The simulation is performed with Cadence Virtuoso 45nm CMOS technology at supply voltage (VDD) of 0.6 V.
{"title":"A Single Bitline Highly Stable, Low Power With High Speed Half-Select Disturb Free 11T SRAM Cell","authors":"Lokesh Soni, Neeta Pandey","doi":"10.1145/3653675","DOIUrl":"https://doi.org/10.1145/3653675","url":null,"abstract":"<p>A half-select disturb-free 11T (HF11T) static random access memory (SRAM) cell with low power, better stability and high speed is presented in this paper. The proposed SRAM cell works well with bit-interleaving design, which enhances soft-error immunity. A comparison of the proposed HF11T cell with other cutting-edge designs such as single-ended HS free 11T (SEHF11T), a shared-pass-gate 11T (SPG11T), data-dependent stack PMOS switching 10T (DSPS10T), a single-ended half-selected robust 12T (HSR12T), and 11T SRAM cells has been made. It exhibits 4.85 × /9.19 × less read delay (<i>T<sub>RA</sub></i>) and write delay (<i>T<sub>WA</sub></i>), respectively as compared to other considered SRAM cells. It achieves 1.07 × /1.02 × better read and write stability, respectively than the considered SRAM cells. It shows maximum reduction of 1.68 × /4.58 × /94.72 × /9 × /145 × leakage power, read power, write power consumption, read power delay product (PDP) and write PDP respectively, than the considered SRAM cells. In addition, the proposed HF11T cell achieves 10.14 × higher <i>I<sub>on</sub></i>/<i>I<sub>off</sub></i> ratio than the other compared cells. These improvements come with a trade-off, resulting in 1.13 × more <i>T<sub>RA</sub></i> compared to SPG11T. The simulation is performed with Cadence Virtuoso 45nm CMOS technology at supply voltage (<i>V<sub>DD</sub></i>) of 0.6 V.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"59 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiseung Kim, Hyunsei Lee, Mohsen Imani, Yeseong Kim
Hyperdimensional computing (HDC) is a computing paradigm inspired by the mechanisms of human memory, characterizing data through high-dimensional vector representations, known as hypervectors. Recent advancements in HDC have explored its potential as a learning model, leveraging its straightforward arithmetic and high efficiency. The traditional HDC frameworks are hampered by two primary static elements: randomly generated encoders and fixed learning rates. These static components significantly limit model adaptability and accuracy. The static, randomly generated encoders, while ensuring high-dimensional representation, fail to adapt to evolving data relationships, thereby constraining the model’s ability to accurately capture and learn from complex patterns. Similarly, the fixed nature of the learning rate does not account for the varying needs of the training process over time, hindering efficient convergence and optimal performance. This paper introduces (mathsf {TrainableHD} ), a novel HDC framework that enables dynamic training of the randomly generated encoder depending on the feedback of the learning data, thereby addressing the static nature of conventional HDC encoders. (mathsf {TrainableHD} ) also enhances the training performance by incorporating adaptive optimizer algorithms in learning the hypervectors. We further refine (mathsf {TrainableHD} ) with effective quantization to enhance efficiency, allowing the execution of the inference phase in low-precision accelerators. Our evaluations demonstrate that (mathsf {TrainableHD} ) significantly improves HDC accuracy by up to 27.99% (averaging 7.02%) without additional computational costs during inference, achieving a performance level comparable to state-of-the-art deep learning models. Furthermore, (mathsf {TrainableHD} ) is optimized for execution speed and energy efficiency. Compared to deep learning on a low-power GPU platform like NVIDIA Jetson Xavier, (mathsf {TrainableHD} ) is 56.4 times faster and 73 times more energy efficient. This efficiency is further augmented through the use of Encoder Interval Training (EIT) and adaptive optimizer algorithms, enhancing the training process without compromising the model’s accuracy.
{"title":"Advancing Hyperdimensional Computing Based on Trainable Encoding and Adaptive Training for Efficient and Accurate Learning","authors":"Jiseung Kim, Hyunsei Lee, Mohsen Imani, Yeseong Kim","doi":"10.1145/3665891","DOIUrl":"https://doi.org/10.1145/3665891","url":null,"abstract":"<p>Hyperdimensional computing (HDC) is a computing paradigm inspired by the mechanisms of human memory, characterizing data through high-dimensional vector representations, known as hypervectors. Recent advancements in HDC have explored its potential as a learning model, leveraging its straightforward arithmetic and high efficiency. The traditional HDC frameworks are hampered by two primary static elements: randomly generated encoders and fixed learning rates. These static components significantly limit model adaptability and accuracy. The static, randomly generated encoders, while ensuring high-dimensional representation, fail to adapt to evolving data relationships, thereby constraining the model’s ability to accurately capture and learn from complex patterns. Similarly, the fixed nature of the learning rate does not account for the varying needs of the training process over time, hindering efficient convergence and optimal performance. This paper introduces (mathsf {TrainableHD} ), a novel HDC framework that enables dynamic training of the randomly generated encoder depending on the feedback of the learning data, thereby addressing the static nature of conventional HDC encoders. (mathsf {TrainableHD} ) also enhances the training performance by incorporating adaptive optimizer algorithms in learning the hypervectors. We further refine (mathsf {TrainableHD} ) with effective quantization to enhance efficiency, allowing the execution of the inference phase in low-precision accelerators. Our evaluations demonstrate that (mathsf {TrainableHD} ) significantly improves HDC accuracy by up to 27.99% (averaging 7.02%) without additional computational costs during inference, achieving a performance level comparable to state-of-the-art deep learning models. Furthermore, (mathsf {TrainableHD} ) is optimized for execution speed and energy efficiency. Compared to deep learning on a low-power GPU platform like NVIDIA Jetson Xavier, (mathsf {TrainableHD} ) is 56.4 times faster and 73 times more energy efficient. This efficiency is further augmented through the use of Encoder Interval Training (EIT) and adaptive optimizer algorithms, enhancing the training process without compromising the model’s accuracy.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"21 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141253460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hadi Esmaeilzadeh, Soroush Ghodrati, Andrew Kahng, Joon Kyung Kim, Sean Kinzer, Sayak Kundu, Rohan Mahapatra, Susmita Dey Manasi, Sachin Sapatnekar, Zhiang Wang, Ziqing Zeng
Parameterizable machine learning (ML) accelerators are the product of recent breakthroughs in ML. To fully enable their design space exploration (DSE), we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It adopts a unified approach that combines power, performance, and area (PPA) analysis with frontend performance simulation, thereby achieving a realistic estimation of both backend PPA and system metrics such as runtime and energy. In addition, our framework includes a fully automated DSE technique, which optimizes backend and system metrics through an automated search of architectural and backend parameters. Experimental studies show that our approach consistently predicts backend PPA and system metrics with an average 7% or less prediction error for the ASIC implementation of two deep learning accelerator platforms, VTA and VeriGOOD-ML, in both a commercial 12 nm process and a research-oriented 45 nm process.
{"title":"An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators","authors":"Hadi Esmaeilzadeh, Soroush Ghodrati, Andrew Kahng, Joon Kyung Kim, Sean Kinzer, Sayak Kundu, Rohan Mahapatra, Susmita Dey Manasi, Sachin Sapatnekar, Zhiang Wang, Ziqing Zeng","doi":"10.1145/3664652","DOIUrl":"https://doi.org/10.1145/3664652","url":null,"abstract":"<p>Parameterizable machine learning (ML) accelerators are the product of recent breakthroughs in ML. To fully enable their design space exploration (DSE), we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It adopts a unified approach that combines power, performance, and area (PPA) analysis with frontend performance simulation, thereby achieving a realistic estimation of both backend PPA and system metrics such as runtime and energy. In addition, our framework includes a fully automated DSE technique, which optimizes backend and system metrics through an automated search of architectural and backend parameters. Experimental studies show that our approach consistently predicts backend PPA and system metrics with an average 7% or less prediction error for the ASIC implementation of two deep learning accelerator platforms, VTA and VeriGOOD-ML, in both a commercial 12 nm process and a research-oriented 45 nm process.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"2674 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140929019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chukwufumnanya Ogbogu, Biresh K. Joardar, Krishnendu Chakrabarty, Jana Doppa, Partha Pratim Pande
Graph Neural Networks (GNNs) have achieved remarkable accuracy in cognitive tasks such as predictive analytics on graph-structured data. Hence, they have become very popular in diverse real-world applications. However, GNN training with large real-world graph datasets in edge-computing scenarios is both memory- and compute-intensive. Traditional computing platforms such as CPUs and GPUs do not provide the energy efficiency and low latency required in edge intelligence applications due to their limited memory bandwidth. Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have been proposed as suitable candidates for accelerating AI applications at the edge, including GNN training. However, ReRAM-based PIM architectures suffer from low reliability due to their limited endurance, and low performance when they are used for GNN training in real-world scenarios with large graphs. In this work, we propose a learning-for-data-pruning framework, which leverages a trained Binary Graph Classifier (BGC) to reduce the size of the input data graph by pruning subgraphs early in the training process to accelerate the GNN training process on ReRAM-based architectures. The proposed light-weight BGC model reduces the amount of redundant information in input graph(s) to speed up the overall training process, improves the reliability of the ReRAM-based PIM accelerator, and reduces the overall training cost. This enables fast, energy-efficient, and reliable GNN training on ReRAM-based architectures. Our experimental results demonstrate that using this learning for data pruning framework, we can accelerate GNN training and improve the reliability of ReRAM-based PIM architectures by up to 1.6 ×, and reduce the overall training cost by 100 × compared to state-of-the-art data pruning techniques.
{"title":"Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators","authors":"Chukwufumnanya Ogbogu, Biresh K. Joardar, Krishnendu Chakrabarty, Jana Doppa, Partha Pratim Pande","doi":"10.1145/3656171","DOIUrl":"https://doi.org/10.1145/3656171","url":null,"abstract":"<p>Graph Neural Networks (GNNs) have achieved remarkable accuracy in cognitive tasks such as predictive analytics on graph-structured data. Hence, they have become very popular in diverse real-world applications. However, GNN training with large real-world graph datasets in edge-computing scenarios is both memory- and compute-intensive. Traditional computing platforms such as CPUs and GPUs do not provide the energy efficiency and low latency required in edge intelligence applications due to their limited memory bandwidth. Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have been proposed as suitable candidates for accelerating AI applications at the edge, including GNN training. However, ReRAM-based PIM architectures suffer from low reliability due to their limited endurance, and low performance when they are used for GNN training in real-world scenarios with large graphs. In this work, we propose a learning-for-data-pruning framework, which leverages a trained Binary Graph Classifier (BGC) to reduce the size of the input data graph by pruning subgraphs early in the training process to accelerate the GNN training process on ReRAM-based architectures. The proposed light-weight BGC model reduces the amount of redundant information in input graph(s) to speed up the overall training process, improves the reliability of the ReRAM-based PIM accelerator, and reduces the overall training cost. This enables fast, energy-efficient, and reliable GNN training on ReRAM-based architectures. Our experimental results demonstrate that using this learning for data pruning framework, we can accelerate GNN training and improve the reliability of ReRAM-based PIM architectures by up to 1.6 ×, and reduce the overall training cost by 100 × compared to state-of-the-art data pruning techniques.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"6 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rijoy Mukherjee, Archisman Ghosh, Rajat Subhra Chakraborty
Modern integrated circuit (IC) design incorporates the usage of proprietary computer-aided design (CAD) software and integration of third-party hardware intellectual property (IP) cores. Subsequently, the fabrication process for the design takes place in untrustworthy offshore foundries that raises concerns regarding security and reliability. Hardware Trojans (HTs) are difficult to detect malicious modifications to IC that constitute a major threat, which if undetected prior to deployment, can lead to catastrophic functional failures or the unauthorized leakage of confidential information. Apart from the risks posed by rogue human agents, recent studies have shown that high-level synthesis (HLS) CAD software can serve as a potent attack vector for inserting Hardware Trojans (HTs). In this paper, we introduce a novel automated attack vector, which we term “HLS-IRT”, by inserting HT in the register transfer logic (RTL) description of circuits generated during a HLS based IC design flow, by directly modifying the compiler-generated intermediate representation (IR) corresponding to the design. We demonstrate the attack using a design and implementation flow based on the open-source Bambu HLS software and Xilinx FPGA, on several hardware accelerators spanning different application domains. Our results show that the resulting HTs are surreptitious and effective, while incurring minimal design overhead. We also propose a novel detection scheme for HLS-IRT, since existing techniques are found to be inadequate to detect the proposed HTs.
{"title":"HLS-IRT: Hardware Trojan Insertion through Modification of Intermediate Representation During High-Level Synthesis","authors":"Rijoy Mukherjee, Archisman Ghosh, Rajat Subhra Chakraborty","doi":"10.1145/3663477","DOIUrl":"https://doi.org/10.1145/3663477","url":null,"abstract":"<p>Modern integrated circuit (IC) design incorporates the usage of proprietary computer-aided design (CAD) software and integration of third-party hardware intellectual property (IP) cores. Subsequently, the fabrication process for the design takes place in untrustworthy offshore foundries that raises concerns regarding security and reliability. Hardware Trojans (HTs) are difficult to detect malicious modifications to IC that constitute a major threat, which if undetected prior to deployment, can lead to catastrophic functional failures or the unauthorized leakage of confidential information. Apart from the risks posed by rogue human agents, recent studies have shown that high-level synthesis (HLS) CAD software can serve as a potent attack vector for inserting Hardware Trojans (HTs). In this paper, we introduce a novel automated attack vector, which we term “HLS-IRT”, by inserting HT in the register transfer logic (RTL) description of circuits generated during a HLS based IC design flow, by directly modifying the compiler-generated intermediate representation (IR) corresponding to the design. We demonstrate the attack using a design and implementation flow based on the open-source <i>Bambu</i> HLS software and <i>Xilinx</i> FPGA, on several hardware accelerators spanning different application domains. Our results show that the resulting HTs are surreptitious and effective, while incurring minimal design overhead. We also propose a novel detection scheme for HLS-IRT, since existing techniques are found to be inadequate to detect the proposed HTs.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"45 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-quality passive devices are becoming increasingly important for the development of mobile devices and telecommunications, but obtaining such devices through simulation and analysis of electromagnetic (EM) behavior is time-consuming. To address this challenge, artificial neural network (ANN) models have emerged as an effective tool for modeling EM behavior, with NeuroTF being a representative example. However, these models are limited by the specific form of the transfer function, leading to discontinuity issues and high sensitivities. Moreover, previous methods have overlooked the physical relationship between distributed parameters, resulting in unacceptable numeric errors in the conversion results. To overcome these limitations, we propose two different neural network architectures: DeepOTF and ComplexTF. DeepOTF is a data-driven deep operator network for automatically learning feasible transfer functions for different geometric parameters. ComplexTF utilizes complex-valued neural networks to fit feasible transfer functions for different geometric parameters in the complex domain while maintaining causality and passivity. Our approach also employs an Equations-constraint Learning scheme to ensure the strict consistency of predictions and a dynamic weighting strategy to balance optimization objectives. The experimental results demonstrate that our framework shows superior performance than baseline methods, achieving up to 1700 × higher accuracy.
{"title":"DeepOTF: Learning Equations-constrained Prediction for Electromagnetic Behavior","authors":"Peng Xu, Siyuan XU, Tinghuan Chen, Guojin Chen, Tsungyi Ho, Bei Yu","doi":"10.1145/3663476","DOIUrl":"https://doi.org/10.1145/3663476","url":null,"abstract":"<p>High-quality passive devices are becoming increasingly important for the development of mobile devices and telecommunications, but obtaining such devices through simulation and analysis of electromagnetic (EM) behavior is time-consuming. To address this challenge, artificial neural network (ANN) models have emerged as an effective tool for modeling EM behavior, with NeuroTF being a representative example. However, these models are limited by the specific form of the transfer function, leading to discontinuity issues and high sensitivities. Moreover, previous methods have overlooked the physical relationship between distributed parameters, resulting in unacceptable numeric errors in the conversion results. To overcome these limitations, we propose two different neural network architectures: DeepOTF and ComplexTF. DeepOTF is a data-driven deep operator network for automatically learning feasible transfer functions for different geometric parameters. ComplexTF utilizes complex-valued neural networks to fit feasible transfer functions for different geometric parameters in the complex domain while maintaining causality and passivity. Our approach also employs an Equations-constraint Learning scheme to ensure the strict consistency of predictions and a dynamic weighting strategy to balance optimization objectives. The experimental results demonstrate that our framework shows superior performance than baseline methods, achieving up to 1700 × higher accuracy. </p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"53 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fault attacks pose a potent threat to modern cryptographic implementations, particularly those used in physically approachable embedded devices in IoT environments. Information security in such resource-constrained devices is ensured using lightweight ciphers, where combinational circuit implementations of SBox are preferable over look-up tables (LUT) as they are more efficient regarding area, power, and memory requirements. Most existing fault analysis techniques focus on fault injection in memory cells and registers. Recently, a novel fault model and analysis technique, namely Semi-Permanent Stuck-At (SPSA) fault analysis, has been proposed to evaluate the security of ciphers with combinational circuit implementation of Substitution layer elements, SBox. In this work, we propose optimized techniques to recover the key in a minimum number of ciphertexts in such implementations of lightweight ciphers. Based on the proposed techniques, a key recovery attack on the NIST lightweight cryptography (NIST-LWC) standardization process finalist, Elephant AEAD, has been proposed. The proposed key recovery attack is validated on two versions of Elephant cipher. The proposed fault analysis approach recovered the secret key within 85 − 240 ciphertexts, calculated over 1000 attack instances. To the best of our knowledge, this is the first work on fault analysis attacks on the Elephant scheme. Furthermore, an optimized combinational circuit implementation of Spongent SBox (SBox used in Elephant cipher) is proposed, having a smaller gate count than the optimized implementation reported in the literature. The proposed fault analysis techniques are validated on primary and optimized versions of Spongent SBox through Verilog simulations. Further, we pinpoint SPSA hotspots in the lightweight GIFT cipher SBox architecture. We observe that GIFT SBox exhibits resilience towards the proposed SPSA fault analysis technique under the single fault adversarial model. However, eight SPSA fault patterns reduce the nonlinearity of the SBox to zero, rendering it vulnerable to linear cryptanalysis. Conclusively, SPSA faults may adversely affect the cryptographic properties of an SBox, thereby leading to trivial key recovery. The GIFT cipher is used as an example to focus on two aspects: i) its SBox construction is resilient to the proposed SPSA analysis and therefore characterizing such constructions for SPSA resilience and, ii) an SBox even though resilient to the proposed SPSA analysis, may exhibit vulnerabilities towards other classical analysis techniques when subjected to SPSA faults. Our work reports new vulnerabilities in fault analysis in the combinational circuit implementations of cryptographic protocols.
{"title":"Semi-Permanent Stuck-At Fault injection attacks on Elephant and GIFT lightweight ciphers","authors":"Priyanka Joshi, Bodhisatwa Mazumdar","doi":"10.1145/3662734","DOIUrl":"https://doi.org/10.1145/3662734","url":null,"abstract":"<p>Fault attacks pose a potent threat to modern cryptographic implementations, particularly those used in physically approachable embedded devices in IoT environments. Information security in such resource-constrained devices is ensured using lightweight ciphers, where combinational circuit implementations of SBox are preferable over look-up tables (LUT) as they are more efficient regarding area, power, and memory requirements. Most existing fault analysis techniques focus on fault injection in memory cells and registers. Recently, a novel fault model and analysis technique, namely <i>Semi-Permanent Stuck-At</i> (SPSA) fault analysis, has been proposed to evaluate the security of ciphers with combinational circuit implementation of <i>Substitution layer</i> elements, SBox. In this work, we propose optimized techniques to recover the key in a minimum number of ciphertexts in such implementations of lightweight ciphers. Based on the proposed techniques, a key recovery attack on the NIST lightweight cryptography (NIST-LWC) standardization process finalist, <monospace>Elephant</monospace> AEAD, has been proposed. The proposed key recovery attack is validated on two versions of <monospace>Elephant</monospace> cipher. The proposed fault analysis approach recovered the secret key within 85 − 240 ciphertexts, calculated over 1000 attack instances. To the best of our knowledge, this is the first work on fault analysis attacks on the <monospace>Elephant</monospace> scheme. Furthermore, an optimized combinational circuit implementation of <i>Spongent</i> SBox (SBox used in <monospace>Elephant</monospace> cipher) is proposed, having a smaller gate count than the optimized implementation reported in the literature. The proposed fault analysis techniques are validated on primary and optimized versions of <i>Spongent</i> SBox through Verilog simulations. Further, we pinpoint SPSA hotspots in the lightweight <monospace>GIFT</monospace> cipher SBox architecture. We observe that <monospace>GIFT</monospace> SBox exhibits resilience towards the proposed SPSA fault analysis technique under the single fault adversarial model. However, <i>eight</i> SPSA fault patterns reduce the nonlinearity of the SBox to zero, rendering it vulnerable to linear cryptanalysis. Conclusively, SPSA faults may adversely affect the cryptographic properties of an SBox, thereby leading to trivial key recovery. The <monospace>GIFT</monospace> cipher is used as an example to focus on two aspects: i) its SBox construction is resilient to the proposed SPSA analysis and therefore characterizing such constructions for SPSA resilience and, ii) an SBox even though resilient to the proposed SPSA analysis, may exhibit vulnerabilities towards other classical analysis techniques when subjected to SPSA faults. Our work reports new vulnerabilities in fault analysis in the combinational circuit implementations of cryptographic protocols.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"1 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140810874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent years have seen an increasing trend in designing AI accelerators together with the rest of the system, including CPUs and memory hierarchy. This trend calls for high-quality simulators or analytical models that enable such kind of co-exploration. Currently, the majority of such exploration is supported by AI accelerator analytical models. But such models usually overlook the non-trivial impact of congestion of shared resources, non-ideal hardware utilization and non-zero CPU scheduler overhead, which could only be modeled by cycle-level simulators. However, most simulators with full-stack toolchains are proprietary to corporations, and the few open-source simulators are suffering from either weak compilers or limited space of modeling. This framework resolves these issues by proposing a compilation and simulation flow to run arbitrary Caffe neural network models on the NVIDIA Deep Learning Accelerator (NVDLA) with gem5, a cycle-level simulator, and by adding more building blocks including scratchpad allocation, multi-accelerator scheduling, tensor-level prefetching mechanisms and a DMA-aided embedded buffer to map workload to multiple NVDLAs. The proposed framework has been tested and verified on a set of convolution neural networks, showcasing the capability of modeling complex buffer management strategies, scheduling policies and hardware architectures. As a case study of this framework, we demonstrate the importance of adopting different buffering strategies for activation and weight tensors in AI accelerators to acquire remarkable speedup.
近年来,将人工智能加速器与系统其他部分(包括中央处理器和内存层次结构)一起设计的趋势日益明显。这种趋势需要高质量的模拟器或分析模型来实现这种共同探索。目前,这种探索大多由人工智能加速器分析模型支持。但这些模型通常会忽略共享资源拥塞、非理想硬件利用率和 CPU 调度器开销不为零等非同小可的影响,而这些影响只能通过周期级模拟器来模拟。然而,大多数具有全栈工具链的模拟器都是企业专有的,而少数开源模拟器要么编译器功能不强,要么建模空间有限。为了解决这些问题,本框架提出了一种编译和仿真流程,利用周期级仿真器 gem5 在英伟达深度学习加速器(NVDLA)上运行任意 Caffe 神经网络模型,并添加了更多构建模块,包括抓板分配、多加速器调度、张量级预取机制和 DMA 辅助嵌入式缓冲区,以便将工作负载映射到多个 NVDLA。所提出的框架已在一组卷积神经网络上进行了测试和验证,展示了对复杂的缓冲区管理策略、调度策略和硬件架构进行建模的能力。作为该框架的一个案例研究,我们展示了在人工智能加速器中对激活和权重张量采用不同缓冲策略以获得显著加速的重要性。
{"title":"gem5-NVDLA: A Simulation Framework for Compiling, Scheduling and Architecture Evaluation on AI System-on-Chips","authors":"Chengtao Lai, Wei Zhang","doi":"10.1145/3661997","DOIUrl":"https://doi.org/10.1145/3661997","url":null,"abstract":"<p>Recent years have seen an increasing trend in designing AI accelerators together with the rest of the system, including CPUs and memory hierarchy. This trend calls for high-quality simulators or analytical models that enable such kind of co-exploration. Currently, the majority of such exploration is supported by AI accelerator analytical models. But such models usually overlook the non-trivial impact of congestion of shared resources, non-ideal hardware utilization and non-zero CPU scheduler overhead, which could only be modeled by cycle-level simulators. However, most simulators with full-stack toolchains are proprietary to corporations, and the few open-source simulators are suffering from either weak compilers or limited space of modeling. This framework resolves these issues by proposing a compilation and simulation flow to run arbitrary Caffe neural network models on the NVIDIA Deep Learning Accelerator (NVDLA) with gem5, a cycle-level simulator, and by adding more building blocks including scratchpad allocation, multi-accelerator scheduling, tensor-level prefetching mechanisms and a DMA-aided embedded buffer to map workload to multiple NVDLAs. The proposed framework has been tested and verified on a set of convolution neural networks, showcasing the capability of modeling complex buffer management strategies, scheduling policies and hardware architectures. As a case study of this framework, we demonstrate the importance of adopting different buffering strategies for activation and weight tensors in AI accelerators to acquire remarkable speedup.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"31 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140810870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian-De Li, Sying-Jyan Wang, Katherine Shu-Min Li, Tsung-Yi Ho
Paper-based digital microfluidic biochip (PB-DMFB) technology provides a promising solution to many biochemical applications. However, the PB-DMFB manufacturing process may suffer from potential security threats. For example, Trojan insertion attack may affect the functionality of PB-DMFBs. To ensure the correct functionality of PB-DMFBs, we propose a watermarking scheme to hide information in the PB-DMFB layout, which allows users to check design integrity and authenticate the source of the PB-DMFB design. As a result, the proposed method serves as a countermeasure against Trojan insertion attacks in addition to proof of authorship.
{"title":"Enhanced Watermarking for Paper-Based Digital Microfluidic Biochips","authors":"Jian-De Li, Sying-Jyan Wang, Katherine Shu-Min Li, Tsung-Yi Ho","doi":"10.1145/3661309","DOIUrl":"https://doi.org/10.1145/3661309","url":null,"abstract":"<p>Paper-based digital microfluidic biochip (PB-DMFB) technology provides a promising solution to many biochemical applications. However, the PB-DMFB manufacturing process may suffer from potential security threats. For example, Trojan insertion attack may affect the functionality of PB-DMFBs. To ensure the correct functionality of PB-DMFBs, we propose a watermarking scheme to hide information in the PB-DMFB layout, which allows users to check design integrity and authenticate the source of the PB-DMFB design. As a result, the proposed method serves as a countermeasure against Trojan insertion attacks in addition to proof of authorship.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"1 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Baroffio, Federico Reghenzani, William Fornaciari
Software-Implemented Hardware Fault Tolerance (SIHFT) is a modern approach for tackling random hardware faults of dependable systems employing solely software solutions. This work extends an automatic compiler-based SIHFT hardening tool called ASPIS, enhancing it with novel protection mechanisms and overhead-reduction techniques, also providing an extensive analysis of its compliance with the non-trivial workload of the open-source Real-Time Operating System FreeRTOS. A thorough experimental fault-injection campaign on an STM32 board shows how the system achieves remarkably high tolerance to single-event upsets and a comparison between the SIHFT mechanisms implemented summarises the trade-off between the overhead introduced and the detection capabilities of the various solutions.
{"title":"Enhanced Compiler Technology for Software-based Hardware Fault Detection","authors":"Davide Baroffio, Federico Reghenzani, William Fornaciari","doi":"10.1145/3660524","DOIUrl":"https://doi.org/10.1145/3660524","url":null,"abstract":"<p>Software-Implemented Hardware Fault Tolerance (SIHFT) is a modern approach for tackling random hardware faults of dependable systems employing solely software solutions. This work extends an automatic compiler-based SIHFT hardening tool called ASPIS, enhancing it with novel protection mechanisms and overhead-reduction techniques, also providing an extensive analysis of its compliance with the non-trivial workload of the open-source Real-Time Operating System FreeRTOS. A thorough experimental fault-injection campaign on an STM32 board shows how the system achieves remarkably high tolerance to single-event upsets and a comparison between the SIHFT mechanisms implemented summarises the trade-off between the overhead introduced and the detection capabilities of the various solutions.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"211 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}