Deep learning (DL) models are enabling a significant paradigm shift in a diverse range of fields, including natural language processing, computer vision, as well as the design and automation of complex integrated circuits. While the deep models – and optimizations-based on them, e.g., Deep Reinforcement Learning (RL) – demonstrate a superior performance and a great capability for automated representation learning, earlier works have revealed the vulnerability of DLs to various attacks. The vulnerabilities include adversarial samples, model poisoning, and fault injection attacks. On the one hand, these security threats could divert the behavior of the DL model and lead to incorrect decisions in critical tasks. On the other hand, the susceptibility of DLs to potential attacks might thwart trustworthy technology transfer as well as reliable DL deployment. In this work, we investigate the existing defense techniques to protect DLs against the above-mentioned security threats. Particularly, we review end-to-end defense schemes for robust deep learning in both centralized and federated learning settings. Our comprehensive taxonomy and horizontal comparisons reveal an important fact that defense strategies developed using DL/software/hardware co-design outperform the DL/software-only counterparts and show how they can achieve very efficient and latency-optimized defenses for real-world applications. We believe our systemization of knowledge sheds light on the promising performance of hardware-software co-design of DL security methodologies and can guide the development of future defenses.
{"title":"Systemization of Knowledge: Robust Deep Learning using Hardware-software co-design in Centralized and Federated Settings","authors":"Ruisi Zhang, Shehzeen Samarah Hussain, Huili Chen, Mojan Javaheripi, F. Koushanfar","doi":"10.1145/3616868","DOIUrl":"https://doi.org/10.1145/3616868","url":null,"abstract":"Deep learning (DL) models are enabling a significant paradigm shift in a diverse range of fields, including natural language processing, computer vision, as well as the design and automation of complex integrated circuits. While the deep models – and optimizations-based on them, e.g., Deep Reinforcement Learning (RL) – demonstrate a superior performance and a great capability for automated representation learning, earlier works have revealed the vulnerability of DLs to various attacks. The vulnerabilities include adversarial samples, model poisoning, and fault injection attacks. On the one hand, these security threats could divert the behavior of the DL model and lead to incorrect decisions in critical tasks. On the other hand, the susceptibility of DLs to potential attacks might thwart trustworthy technology transfer as well as reliable DL deployment. In this work, we investigate the existing defense techniques to protect DLs against the above-mentioned security threats. Particularly, we review end-to-end defense schemes for robust deep learning in both centralized and federated learning settings. Our comprehensive taxonomy and horizontal comparisons reveal an important fact that defense strategies developed using DL/software/hardware co-design outperform the DL/software-only counterparts and show how they can achieve very efficient and latency-optimized defenses for real-world applications. We believe our systemization of knowledge sheds light on the promising performance of hardware-software co-design of DL security methodologies and can guide the development of future defenses.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43676970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Debabrata Senapati, Kousik Rajesh, C. Karfa, A. Sarkar
To meet application-specific performance demands, recent embedded platforms often involve the use of intricate micro-architectural designs and very small feature sizes leading to complex chips with multi-million gates. Such ultra-high gate densities often make these chips susceptible to inappropriate surges in core temperatures. Temperature surges above a specific threshold may throttle processor performance, enhance cooling costs and reduce processor life expectancy. This work proposes a generic temperature management strategy which can be easily employed to adapt existing state-of-the-art task graph schedulers so that schedules generated by them never violate stipulated thermal bounds. The overall temperature-aware task graph scheduling problem has first been formally modeled as a constraint optimization formulation whose solution is shown to be prohibitively expensive in terms of computational overheads. Based on insights obtained through the formal model, a new fast and efficient heuristic algorithm called TMDS, has been designed. Experimental evaluation over diverse test case scenarios shows that TMDS is able to deliver lower schedule lengths compared to the temperature-aware versions of four prominent makespan minimizing algorithms, namely HEFT, PEFT, PPTS, PSLS. Additionally, a case study with an adaptive cruise controller in automotive systems has been included to exhibit the applicability of TMDS in real-world settings.
{"title":"TMDS: A Temperature-aware Makespan Minimizing DAG Scheduler for Heterogeneous Distributed Systems","authors":"Debabrata Senapati, Kousik Rajesh, C. Karfa, A. Sarkar","doi":"10.1145/3616869","DOIUrl":"https://doi.org/10.1145/3616869","url":null,"abstract":"To meet application-specific performance demands, recent embedded platforms often involve the use of intricate micro-architectural designs and very small feature sizes leading to complex chips with multi-million gates. Such ultra-high gate densities often make these chips susceptible to inappropriate surges in core temperatures. Temperature surges above a specific threshold may throttle processor performance, enhance cooling costs and reduce processor life expectancy. This work proposes a generic temperature management strategy which can be easily employed to adapt existing state-of-the-art task graph schedulers so that schedules generated by them never violate stipulated thermal bounds. The overall temperature-aware task graph scheduling problem has first been formally modeled as a constraint optimization formulation whose solution is shown to be prohibitively expensive in terms of computational overheads. Based on insights obtained through the formal model, a new fast and efficient heuristic algorithm called TMDS, has been designed. Experimental evaluation over diverse test case scenarios shows that TMDS is able to deliver lower schedule lengths compared to the temperature-aware versions of four prominent makespan minimizing algorithms, namely HEFT, PEFT, PPTS, PSLS. Additionally, a case study with an adaptive cruise controller in automotive systems has been included to exhibit the applicability of TMDS in real-world settings.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47015221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yajing Chang, Yingjian Yan, Chunsheng Zhu, Yanjiang Liu
Post-quantum cryptography (PQC) has become the most promising cryptographic scheme against the threat of quantum computing to conventional public-key cryptographic schemes. Saber, as the finalist in the third round of the PQC standardization procedure, presents an appealing option for embedded systems due to its high encryption efficiency and accessibility. However, side-channel attack (SCA) can easily reveal confidential information by analyzing the physical manifestations, and several works demonstrate that Saber is vulnerable to SCAs. In this work, a ciphertext comparison method for masking design based on bitslicing technique and zerotest is proposed, which balances the trade-off between the performance and security of comparing two arrays. The mathematical description of the proposed ciphertext comparison method is provided, and its correctness and security metrics are analyzed under the concept of PINI. Moreover, a high-order masking approach based on the state-of-the-art, including the hash functions, centered binomial sampling, masking conversions, and proposed ciphertext comparison is presented, using the bitslicing technique to improve throughput. As a proof of concept, the proposed implementation of Saber is on the ARM Cortex-M4. The performance results show that the run-time overhead factor of 1st-, 2nd-, and 3rd-order masking is 3.01x, 5.58x, and 8.68x, and the dynamic memory used for 1st-, 2nd-, and 3rd-order masking is 17.4kB, 24.0kB, and 30.2kB, respectively. The SCA-resilience evaluation results illustrate that the first-order Test Vectors Leakage Assessment (TVLA) result fails to reveal the secret key with 100,000 traces.
{"title":"A High-Performance Masking Design Approach for Saber against High-order Side-channel Attack","authors":"Yajing Chang, Yingjian Yan, Chunsheng Zhu, Yanjiang Liu","doi":"10.1145/3611670","DOIUrl":"https://doi.org/10.1145/3611670","url":null,"abstract":"Post-quantum cryptography (PQC) has become the most promising cryptographic scheme against the threat of quantum computing to conventional public-key cryptographic schemes. Saber, as the finalist in the third round of the PQC standardization procedure, presents an appealing option for embedded systems due to its high encryption efficiency and accessibility. However, side-channel attack (SCA) can easily reveal confidential information by analyzing the physical manifestations, and several works demonstrate that Saber is vulnerable to SCAs. In this work, a ciphertext comparison method for masking design based on bitslicing technique and zerotest is proposed, which balances the trade-off between the performance and security of comparing two arrays. The mathematical description of the proposed ciphertext comparison method is provided, and its correctness and security metrics are analyzed under the concept of PINI. Moreover, a high-order masking approach based on the state-of-the-art, including the hash functions, centered binomial sampling, masking conversions, and proposed ciphertext comparison is presented, using the bitslicing technique to improve throughput. As a proof of concept, the proposed implementation of Saber is on the ARM Cortex-M4. The performance results show that the run-time overhead factor of 1st-, 2nd-, and 3rd-order masking is 3.01x, 5.58x, and 8.68x, and the dynamic memory used for 1st-, 2nd-, and 3rd-order masking is 17.4kB, 24.0kB, and 30.2kB, respectively. The SCA-resilience evaluation results illustrate that the first-order Test Vectors Leakage Assessment (TVLA) result fails to reveal the secret key with 100,000 traces.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46682768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhtadi Choudhury, Minyan Gao, Avinash L. Varna, Elad Peer, Domenic Forte
Since finite state machines (FSMs) regulate the control flow in circuits, a computing system’s security might be breached by attacking the FSM. Physical attacks are especially worrisome because they can bypass software countermeasures. For example, an attacker can gain illegal access to the sensitive states of an FSM through fault injection, leading to privilege escalation and/or information leakage. Laser fault injection (LFI) provides one of the most effective attack vectors by enabling adversaries to precisely overturn single flip-flops states. Although conventional error correction/detection methodologies have been employed to improve FSM resiliency, their substantial overhead makes them unattractive to circuit designers. In our prior work, a novel decision diagram-based FSM encoding scheme called PATRON was proposed to resist LFI according to attack parameters, e.g., number of simultaneous faults. Although PATRON bested traditional encodings keeping overhead minimum, it provided numerous candidates for FSM designs requiring exhaustive and manual effort to select one optimum candidate. In this article, we automatically select an optimum candidate by enhancing PATRON using linear programming (LP). First, we exploit the proportionality between dynamic power dissipation and switching activity in digital CMOS circuits. Thus, our LP objective minimizes the number of FSM bit switches per transition, for comparatively lower switching activity and hence total power consumption. Second, additional LP constraints along with incorporating the original PATRON rules, systematically enforce bidirectionality to at least two state elements per FSM transition. This bestows protection against different types of fault injection, which we capture with a new unidirectional metric. Enhanced PATRON (EP) achieves superior security at lower power consumption in average compared to PATRON, error-coding, and traditional FSM encoding on five popular benchmarks.
{"title":"Enhanced PATRON: Fault Injection and Power-aware FSM Encoding Through Linear Programming","authors":"Muhtadi Choudhury, Minyan Gao, Avinash L. Varna, Elad Peer, Domenic Forte","doi":"10.1145/3611669","DOIUrl":"https://doi.org/10.1145/3611669","url":null,"abstract":"Since finite state machines (FSMs) regulate the control flow in circuits, a computing system’s security might be breached by attacking the FSM. Physical attacks are especially worrisome because they can bypass software countermeasures. For example, an attacker can gain illegal access to the sensitive states of an FSM through fault injection, leading to privilege escalation and/or information leakage. Laser fault injection (LFI) provides one of the most effective attack vectors by enabling adversaries to precisely overturn single flip-flops states. Although conventional error correction/detection methodologies have been employed to improve FSM resiliency, their substantial overhead makes them unattractive to circuit designers. In our prior work, a novel decision diagram-based FSM encoding scheme called PATRON was proposed to resist LFI according to attack parameters, e.g., number of simultaneous faults. Although PATRON bested traditional encodings keeping overhead minimum, it provided numerous candidates for FSM designs requiring exhaustive and manual effort to select one optimum candidate. In this article, we automatically select an optimum candidate by enhancing PATRON using linear programming (LP). First, we exploit the proportionality between dynamic power dissipation and switching activity in digital CMOS circuits. Thus, our LP objective minimizes the number of FSM bit switches per transition, for comparatively lower switching activity and hence total power consumption. Second, additional LP constraints along with incorporating the original PATRON rules, systematically enforce bidirectionality to at least two state elements per FSM transition. This bestows protection against different types of fault injection, which we capture with a new unidirectional metric. Enhanced PATRON (EP) achieves superior security at lower power consumption in average compared to PATRON, error-coding, and traditional FSM encoding on five popular benchmarks.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45718877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A modified decoupled sense amplifier (MDSA) and modified decoupled sense amplifier with NMOS foot-switch is proposed for improved sensing in differential SRAM for low voltage operation at 22 nm technology node. The MDSA and MDSANF both offer notable improvements to read delay over conventional voltage and current sense amplifiers. At an operating voltage of 0.8 V, the MDSA exhibited a reduced delay of 28.6%, 41.79%, 37.74%, 30.94% compared to modified clamped sense amplifier (MCSA), double tail sense amplifier(DTSA), modified hybrid sense amplifier (MHSA) and conventional latch-type sense amplifier (LSA) respectively. Similarly, the MDSANF demonstrated a delay reduction of 26.13%, 39.78%, 35.58%, 28.55% over MCSA, DTSA, MHSA and LSA respectively. To validate the performance, the MDSA and MDSANF are evaluated using the variation in delay and power consumption across various supply voltages, process corners, input differential bit line voltage (ΔVBL), bit line capacitance CBL) and the sizing of decoupling transistors. Monte Carlo simulations were conducted to analyse the impact of voltage threshold variations on transistor mismatch which leads to an increased occurrence of read failures and a decline in SRAM yield. The performance analysis of various voltage and current sense amplifiers is presented along with MDSA and MDSANF. Area consideration for selection of sensing scheme is important and as such layout of MDSA and MDSANF was performed conforming to the design rules and estimated area for MDSA is 0.297 μm2 whereas MDSANF occupies 0.5192 μm2.
{"title":"Modified Decoupled Sense Amplifier with Improved Sensing Speed for Low-Voltage Differential SRAM","authors":"Ayush, P. Mittal, Rajesh Rohilla","doi":"10.1145/3611672","DOIUrl":"https://doi.org/10.1145/3611672","url":null,"abstract":"A modified decoupled sense amplifier (MDSA) and modified decoupled sense amplifier with NMOS foot-switch is proposed for improved sensing in differential SRAM for low voltage operation at 22 nm technology node. The MDSA and MDSANF both offer notable improvements to read delay over conventional voltage and current sense amplifiers. At an operating voltage of 0.8 V, the MDSA exhibited a reduced delay of 28.6%, 41.79%, 37.74%, 30.94% compared to modified clamped sense amplifier (MCSA), double tail sense amplifier(DTSA), modified hybrid sense amplifier (MHSA) and conventional latch-type sense amplifier (LSA) respectively. Similarly, the MDSANF demonstrated a delay reduction of 26.13%, 39.78%, 35.58%, 28.55% over MCSA, DTSA, MHSA and LSA respectively. To validate the performance, the MDSA and MDSANF are evaluated using the variation in delay and power consumption across various supply voltages, process corners, input differential bit line voltage (ΔVBL), bit line capacitance CBL) and the sizing of decoupling transistors. Monte Carlo simulations were conducted to analyse the impact of voltage threshold variations on transistor mismatch which leads to an increased occurrence of read failures and a decline in SRAM yield. The performance analysis of various voltage and current sense amplifiers is presented along with MDSA and MDSANF. Area consideration for selection of sensing scheme is important and as such layout of MDSA and MDSANF was performed conforming to the design rules and estimated area for MDSA is 0.297 μm2 whereas MDSANF occupies 0.5192 μm2.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43297669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past years, numerous studies demonstrated the vulnerability of deep neural networks (DNNs) to make correct classifications in the presence of small noise. This motivated the formal analysis of DNNs to ensure they delineate acceptable behavior. However, in case the DNN’s behavior is unacceptable for the desired application, these qualitative approaches are ill-equipped to determine the precise degree to which the DNN behaves unacceptably. Towards this, we propose a novel quantitative DNN analysis framework, QuanDA, which does not only check if the DNN delineates certain behavior, but also provides the estimated probability of the DNN to delineate this particular behavior. Unlike the (few) available quantitative DNN analysis frameworks, QuanDA does not use any implicit assumptions on the probability distribution of the hidden nodes, which enables the framework to propagate close to real probability distributions of the hidden node values to each proceeding DNN layer. Furthermore, our framework leverages CUDA to parallelize the analysis, enabling high-speed GPU implementation for fast analysis. The applicability of the framework is demonstrated using the ACAS Xu benchmark, to provide reachability probability estimates for all network nodes. Moreover, this paper also provides potential applications of QuanDA for the analysis of the DNN safety properties.
{"title":"QuanDA: GPU Accelerated Quantitative Deep Neural Network Analysis","authors":"Mahum Naseer, Osman Hasan, Muhammad Shafique","doi":"10.1145/3611671","DOIUrl":"https://doi.org/10.1145/3611671","url":null,"abstract":"Over the past years, numerous studies demonstrated the vulnerability of deep neural networks (DNNs) to make correct classifications in the presence of small noise. This motivated the formal analysis of DNNs to ensure they delineate acceptable behavior. However, in case the DNN’s behavior is unacceptable for the desired application, these qualitative approaches are ill-equipped to determine the precise degree to which the DNN behaves unacceptably. Towards this, we propose a novel quantitative DNN analysis framework, QuanDA, which does not only check if the DNN delineates certain behavior, but also provides the estimated probability of the DNN to delineate this particular behavior. Unlike the (few) available quantitative DNN analysis frameworks, QuanDA does not use any implicit assumptions on the probability distribution of the hidden nodes, which enables the framework to propagate close to real probability distributions of the hidden node values to each proceeding DNN layer. Furthermore, our framework leverages CUDA to parallelize the analysis, enabling high-speed GPU implementation for fast analysis. The applicability of the framework is demonstrated using the ACAS Xu benchmark, to provide reachability probability estimates for all network nodes. Moreover, this paper also provides potential applications of QuanDA for the analysis of the DNN safety properties.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":"1 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42124580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stylianos I. Venieris, J. Fernández-Marqués, Nicholas D. Lane
The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high-performance and energy-efficient inference, significant research effort has been invested in the design of FPGA-based CNN accelerators. In this context, single computation engines constitute a popular design approach that enables the deployment of diverse models without the overhead of fabric reconfiguration. Nevertheless, this flexibility often comes with significantly degraded performance on memory-bound layers and resource underutilisation due to the suboptimal mapping of certain layers on the engine’s fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. This paper presents unzipFPGA, a novel CNN inference system that counteracts the limitations of existing CNN engines. The proposed framework comprises a novel CNN hardware architecture that introduces a weights generator module that enables the on-chip on-the-fly generation of weights, alleviating the negative impact of limited bandwidth on memory-bound layers. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair, leading to an improved accuracy-performance balance. Finally, we introduce an input selective processing element (PE) design that balances the load between PEs in suboptimally mapped layers. Quantitative evaluation shows that the proposed framework yields hardware designs that achieve an average of 2.57 × performance efficiency gain over highly optimised GPU designs for the same power constraints and up to 3.94 × higher performance density over a diverse range of state-of-the-art FPGA-based CNN accelerators.
{"title":"Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation","authors":"Stylianos I. Venieris, J. Fernández-Marqués, Nicholas D. Lane","doi":"10.1145/3611673","DOIUrl":"https://doi.org/10.1145/3611673","url":null,"abstract":"The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high-performance and energy-efficient inference, significant research effort has been invested in the design of FPGA-based CNN accelerators. In this context, single computation engines constitute a popular design approach that enables the deployment of diverse models without the overhead of fabric reconfiguration. Nevertheless, this flexibility often comes with significantly degraded performance on memory-bound layers and resource underutilisation due to the suboptimal mapping of certain layers on the engine’s fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. This paper presents unzipFPGA, a novel CNN inference system that counteracts the limitations of existing CNN engines. The proposed framework comprises a novel CNN hardware architecture that introduces a weights generator module that enables the on-chip on-the-fly generation of weights, alleviating the negative impact of limited bandwidth on memory-bound layers. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair, leading to an improved accuracy-performance balance. Finally, we introduce an input selective processing element (PE) design that balances the load between PEs in suboptimally mapped layers. Quantitative evaluation shows that the proposed framework yields hardware designs that achieve an average of 2.57 × performance efficiency gain over highly optimised GPU designs for the same power constraints and up to 3.94 × higher performance density over a diverse range of state-of-the-art FPGA-based CNN accelerators.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42316070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the continuous shrinking of feature size, detection of lithography hotspots has been raised as one of the major concerns in Design-for-Manufacturability (DFM) of semiconductor processing. Hotspot detection, along with other DFM measures, trades off turn-around time for the yield of IC manufacturing, thus a simplified but wide-range-covered pattern definition is a key essential to the problem. Layout pattern clustering methods, which group geometrically similar layout clips into clusters, have been vastly proposed to identify layout patterns efficiently. To minimize the clustering number for subsequent DFM processing, in this paper, we propose a geometric-matching-based clip relocation technique to increase the opportunity of pattern clustering. Particularly, we formulate the lower-bound of the clustering number as a maximum-clique problem, and we have also proved that the clustering problem can be solved by the result of the maximum-clique very efficiently. Compared with the experimental results of the state-of-the-art approaches on ICCAD 2016 Contest benchmarks, the proposed method can achieve the optimal solutions for all benchmarks with very competitive run-time. To evaluate the scalability, the ICCAD 2016 Contest benchmarks are extended and evaluated. And experimental results on the extended benchmarks demonstrate that our method can reduce the cluster number by 16.59% on average, while the run-time is 74.11% faster on large-scale benchmarks compared with previous works.
{"title":"A General Layout Pattern Clustering Using Geometric Matching Based Clip Relocation and Lower-Bound Aided Optimization","authors":"Xu He, Yao Wang, Zhiyong Fu, Yipei Wang, Yang Guo","doi":"10.1145/3610293","DOIUrl":"https://doi.org/10.1145/3610293","url":null,"abstract":"With the continuous shrinking of feature size, detection of lithography hotspots has been raised as one of the major concerns in Design-for-Manufacturability (DFM) of semiconductor processing. Hotspot detection, along with other DFM measures, trades off turn-around time for the yield of IC manufacturing, thus a simplified but wide-range-covered pattern definition is a key essential to the problem. Layout pattern clustering methods, which group geometrically similar layout clips into clusters, have been vastly proposed to identify layout patterns efficiently. To minimize the clustering number for subsequent DFM processing, in this paper, we propose a geometric-matching-based clip relocation technique to increase the opportunity of pattern clustering. Particularly, we formulate the lower-bound of the clustering number as a maximum-clique problem, and we have also proved that the clustering problem can be solved by the result of the maximum-clique very efficiently. Compared with the experimental results of the state-of-the-art approaches on ICCAD 2016 Contest benchmarks, the proposed method can achieve the optimal solutions for all benchmarks with very competitive run-time. To evaluate the scalability, the ICCAD 2016 Contest benchmarks are extended and evaluated. And experimental results on the extended benchmarks demonstrate that our method can reduce the cluster number by 16.59% on average, while the run-time is 74.11% faster on large-scale benchmarks compared with previous works.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42946077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In modern systems-on-chips (SoCs) several hardware protocols are used for communication and interaction among different modules. These protocols are complex and need to be implemented correctly for correct operation of the SoC. Therefore, protocol verification has received significant attention. However, this verification is often limited to checking high-level properties on a protocol specification or an implementation. Verifying these properties directly on an implementation faces scalability challenges due to its size and design complexity. Further, even after some high-level properties are verified, there is no guarantee that an implementation fully complies with a given specification, even if the same properties have also been checked on the specification. We address these challenges and gaps by adding a layer of component specifications, one for each component in the protocol implementation, and specifying and verifying the interactions at the interfaces between each pair of communicating components. We use the recently proposed formal model termed the Instruction-Level-Abstraction (ILA) as a component specification, which includes an interface specification for the interactions in composing different components. The use of ILA models as component specifications allows us to decompose the complete verification task into two sub-tasks – checking that the composition of ILAs is sequentially equivalent to a verified formal protocol specification, and checking that the protocol implementation is a refinement of the ILA composition. This check requires that each component implementation is a refinement of its ILA specification, and includes interface checks guaranteeing that components interact with each other as specified. We have applied the proposed ILA-based methodology for protocol verification to several third-party design case studies. These include an AXI on-chip communication protocol, an off-chip communication protocol, and a cache coherence protocol. For each system, we successfully detected bugs in the implementation, and show that the full formal verification can be completed in reasonable time and effort.
{"title":"SoC Protocol Implementation Verification Using Instruction-Level Abstraction (ILA) Specifications","authors":"Huaixi Lu, Yue Xing, Aarti Gupta, S. Malik","doi":"10.1145/3610292","DOIUrl":"https://doi.org/10.1145/3610292","url":null,"abstract":"In modern systems-on-chips (SoCs) several hardware protocols are used for communication and interaction among different modules. These protocols are complex and need to be implemented correctly for correct operation of the SoC. Therefore, protocol verification has received significant attention. However, this verification is often limited to checking high-level properties on a protocol specification or an implementation. Verifying these properties directly on an implementation faces scalability challenges due to its size and design complexity. Further, even after some high-level properties are verified, there is no guarantee that an implementation fully complies with a given specification, even if the same properties have also been checked on the specification. We address these challenges and gaps by adding a layer of component specifications, one for each component in the protocol implementation, and specifying and verifying the interactions at the interfaces between each pair of communicating components. We use the recently proposed formal model termed the Instruction-Level-Abstraction (ILA) as a component specification, which includes an interface specification for the interactions in composing different components. The use of ILA models as component specifications allows us to decompose the complete verification task into two sub-tasks – checking that the composition of ILAs is sequentially equivalent to a verified formal protocol specification, and checking that the protocol implementation is a refinement of the ILA composition. This check requires that each component implementation is a refinement of its ILA specification, and includes interface checks guaranteeing that components interact with each other as specified. We have applied the proposed ILA-based methodology for protocol verification to several third-party design case studies. These include an AXI on-chip communication protocol, an off-chip communication protocol, and a cache coherence protocol. For each system, we successfully detected bugs in the implementation, and show that the full formal verification can be completed in reasonable time and effort.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49585589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingsong Peng, Jingchang Bian, Zhengfeng Huang, Senling Wang, Aibin Yan
True random number generators (TRNGs) as an important component of security systems have received a lot of attention for their related research. The previous researches have provided a large number of TRNG solutions, however, they still failed to reach an excellent trade-off in various performance metrics. This paper presents a shift-registers metastability-based TRNG, which is implemented by compact reference units and comparison units. By forcing the D flip-flops in the shift-registers into the metastable state, it optimizes the problem that the conventional metastability entropy sources consume excessive hardware resources. And new method of metastable randomness extraction is used to reduce the bias of metastable output. The proposed TRNG is implemented in Xilinx Spartan-6 and Virtex-6 FPGAs, which generate random sequences that pass the NIST SP800-22, NIST SP800-90B tests and show excellent robustness to voltage and temperature variations. This TRNG can consume only 3 slices of the FPGA, but it has a high throughput rate of 25Mbit/s. In comparison with state-of-the-art FPGA-compatible TRNGs, the proposed TRNG achieves the highest figure of merit FOM, which means that the proposed TRNG significantly outperforms previous researches in terms of hardware resources, throughput rate, and operating frequency trade-offs.
{"title":"A Compact TRNG design for FPGA based on the Metastability of RO-Driven Shift Registers","authors":"Qingsong Peng, Jingchang Bian, Zhengfeng Huang, Senling Wang, Aibin Yan","doi":"10.1145/3610295","DOIUrl":"https://doi.org/10.1145/3610295","url":null,"abstract":"True random number generators (TRNGs) as an important component of security systems have received a lot of attention for their related research. The previous researches have provided a large number of TRNG solutions, however, they still failed to reach an excellent trade-off in various performance metrics. This paper presents a shift-registers metastability-based TRNG, which is implemented by compact reference units and comparison units. By forcing the D flip-flops in the shift-registers into the metastable state, it optimizes the problem that the conventional metastability entropy sources consume excessive hardware resources. And new method of metastable randomness extraction is used to reduce the bias of metastable output. The proposed TRNG is implemented in Xilinx Spartan-6 and Virtex-6 FPGAs, which generate random sequences that pass the NIST SP800-22, NIST SP800-90B tests and show excellent robustness to voltage and temperature variations. This TRNG can consume only 3 slices of the FPGA, but it has a high throughput rate of 25Mbit/s. In comparison with state-of-the-art FPGA-compatible TRNGs, the proposed TRNG achieves the highest figure of merit FOM, which means that the proposed TRNG significantly outperforms previous researches in terms of hardware resources, throughput rate, and operating frequency trade-offs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42235926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}