Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00081
Ruixiang Ma, Fei Wu, Changsheng Xie
To resolve the low generalization ability of the flash lifetime model caused by a small training sample, we propose a multiple source ensemble online domain adaptation scheme, called MSE. MSE uses multiple offline source blocks to assist in establishing a lifetime prediction model for the online target block. MSE migrates information from these blocks to the target block, effectively solving the pain point of insufficient samples for the target block. We simulate the actual use scenarios of NAND flash on the FPGA-based test platform. Experimental results show that prediction accuracy of MSE exceeds 0.91 using only a small number of samples of the target block. Therefore, MSE can be used to improve the space utilization of the flash with low overhead.
{"title":"Intelligent Prediction of Flash Lifetime via Online Domain Adaptation","authors":"Ruixiang Ma, Fei Wu, Changsheng Xie","doi":"10.1109/ICCD53106.2021.00081","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00081","url":null,"abstract":"To resolve the low generalization ability of the flash lifetime model caused by a small training sample, we propose a multiple source ensemble online domain adaptation scheme, called MSE. MSE uses multiple offline source blocks to assist in establishing a lifetime prediction model for the online target block. MSE migrates information from these blocks to the target block, effectively solving the pain point of insufficient samples for the target block. We simulate the actual use scenarios of NAND flash on the FPGA-based test platform. Experimental results show that prediction accuracy of MSE exceeds 0.91 using only a small number of samples of the target block. Therefore, MSE can be used to improve the space utilization of the flash with low overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"261 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133912427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00015
Arnab Raha, Soumendu Kumar Ghosh, Debabrata Mohapatra, D. Mathaikutty, Raymond Sung, C. Brick, V. Raghunathan
Approximate computing (AxC) has advanced from being an emerging design paradigm to becoming one of the most popular and effective methods of energy optimization for applications in the domains of computer vision, image/video processing, data mining, analytics, and search. The simultaneous rise of artificial intelligence (AI) has provided an additional thrust to the adoption of various AxC techniques in intelligent edge platforms where energy-efficiency is not only desirable but necessary. In spite of the big rise in interest for AxC, the adoption of approximate hardware has mostly been limited to only one component of the system (usually the processing subsystem) which often contributes only a fraction of the overall system-level power. A full system approach to AxC enables us to extend approximations to other subsystems, such as the memory, sensor, and communications subsystems. This paper presents the foundational concepts of an approximate TinyML system that applies approximations synergistically to multiple subsystems in an edge inference device. These approximations are applied intelligently to significantly reduce energy while incurring a negligible loss in application-level quality. We demonstrate multiple versions of an approximate smart camera system that can execute state-of-the-art deep neural networks (DNNs) while consuming only a fraction of the total energy in a typical system.
{"title":"Special Session: Approximate TinyML Systems: Full System Approximations for Extreme Energy-Efficiency in Intelligent Edge Devices","authors":"Arnab Raha, Soumendu Kumar Ghosh, Debabrata Mohapatra, D. Mathaikutty, Raymond Sung, C. Brick, V. Raghunathan","doi":"10.1109/ICCD53106.2021.00015","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00015","url":null,"abstract":"Approximate computing (AxC) has advanced from being an emerging design paradigm to becoming one of the most popular and effective methods of energy optimization for applications in the domains of computer vision, image/video processing, data mining, analytics, and search. The simultaneous rise of artificial intelligence (AI) has provided an additional thrust to the adoption of various AxC techniques in intelligent edge platforms where energy-efficiency is not only desirable but necessary. In spite of the big rise in interest for AxC, the adoption of approximate hardware has mostly been limited to only one component of the system (usually the processing subsystem) which often contributes only a fraction of the overall system-level power. A full system approach to AxC enables us to extend approximations to other subsystems, such as the memory, sensor, and communications subsystems. This paper presents the foundational concepts of an approximate TinyML system that applies approximations synergistically to multiple subsystems in an edge inference device. These approximations are applied intelligently to significantly reduce energy while incurring a negligible loss in application-level quality. We demonstrate multiple versions of an approximate smart camera system that can execute state-of-the-art deep neural networks (DNNs) while consuming only a fraction of the total energy in a typical system.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132663834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00051
Sugil Lee, M. Fouda, Jongeun Lee, A. Eltawil, Fadi J. Kurdahi
To overcome the programming variability (PV) of ReRAM crossbar arrays (RCAs), the most common method is program-verify, which, however, has high energy and latency overhead. In this paper we propose a very fast and low-cost method to mitigate the effect of PV and other variability for RCA-based DNN (Deep Neural Network) accelerators. Leveraging the statistical properties of DNN output, our method called Online Batch-Norm Correction (OBNC) can compensate for the effect of programming and other variability on RCA output without using on-chip training or an iterative procedure, and is thus very fast. Also our method does not require a nonideality model or a training dataset, hence very easy to apply. Our experimental results using ternary neural networks with binary and 4-bit activations demonstrate that our OBNC can recover the baseline performance in many variability settings and that our method outperforms a previously known method (VCAM) by large margins when input distribution is asymmetric or activation is multi-bit.
{"title":"Fast and Low-Cost Mitigation of ReRAM Variability for Deep Learning Applications","authors":"Sugil Lee, M. Fouda, Jongeun Lee, A. Eltawil, Fadi J. Kurdahi","doi":"10.1109/ICCD53106.2021.00051","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00051","url":null,"abstract":"To overcome the programming variability (PV) of ReRAM crossbar arrays (RCAs), the most common method is program-verify, which, however, has high energy and latency overhead. In this paper we propose a very fast and low-cost method to mitigate the effect of PV and other variability for RCA-based DNN (Deep Neural Network) accelerators. Leveraging the statistical properties of DNN output, our method called Online Batch-Norm Correction (OBNC) can compensate for the effect of programming and other variability on RCA output without using on-chip training or an iterative procedure, and is thus very fast. Also our method does not require a nonideality model or a training dataset, hence very easy to apply. Our experimental results using ternary neural networks with binary and 4-bit activations demonstrate that our OBNC can recover the baseline performance in many variability settings and that our method outperforms a previously known method (VCAM) by large margins when input distribution is asymmetric or activation is multi-bit.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131761834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00018
Cheng Tan, Tong Geng, Chenhao Xie, Nicolas Bohm Agostini, Jiajia Li, Ang Li, K. Barker, Antonino Tumeo
Coarse-grained reconfigurable arrays (CGRAs) provide higher flexibility than application-specific integrated circuits (ASICs) and higher efficiency than fine-grained reconfigurable devices such as Field Programmable Gate Arrays (FPGAs). However, CGRAs are generally designed to support offloading of a single kernel. While their design, based on communicating functional units, appears to naturally suit streaming applications composed of multiple cooperating kernels, current approaches only statically partition the resources across kernels. However, streaming applications often are data-dependent, leading to variable kernel execution times depending on the input data and impacting the throughput of the entire pipeline if resources are statically allocated. Therefore, in this paper, we discuss the design of DynPaC — a coarse-grained, dynamically, and partially reconfigurable array for data-dependent streaming applications. We discuss the required software and hardware components to manage partial dynamic reconfiguration. We demonstrate that by supporting partial dynamic reconfiguration, we can obtain an average speedup of 1.44× for a representative set of applications w.r.t. static partitioning, with a limited area overhead (6.4% of the entire chip).
{"title":"DynPaC: Coarse-Grained, Dynamic, and Partially Reconfigurable Array for Streaming Applications","authors":"Cheng Tan, Tong Geng, Chenhao Xie, Nicolas Bohm Agostini, Jiajia Li, Ang Li, K. Barker, Antonino Tumeo","doi":"10.1109/ICCD53106.2021.00018","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00018","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) provide higher flexibility than application-specific integrated circuits (ASICs) and higher efficiency than fine-grained reconfigurable devices such as Field Programmable Gate Arrays (FPGAs). However, CGRAs are generally designed to support offloading of a single kernel. While their design, based on communicating functional units, appears to naturally suit streaming applications composed of multiple cooperating kernels, current approaches only statically partition the resources across kernels. However, streaming applications often are data-dependent, leading to variable kernel execution times depending on the input data and impacting the throughput of the entire pipeline if resources are statically allocated. Therefore, in this paper, we discuss the design of DynPaC — a coarse-grained, dynamically, and partially reconfigurable array for data-dependent streaming applications. We discuss the required software and hardware components to manage partial dynamic reconfiguration. We demonstrate that by supporting partial dynamic reconfiguration, we can obtain an average speedup of 1.44× for a representative set of applications w.r.t. static partitioning, with a limited area overhead (6.4% of the entire chip).","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123314958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00029
Ibrahim Haddadi, Issa Qiqieh, R. Shafik, F. Xia, M. A. N. Al-hayanni, Alexandre Yakovlev
Designing energy-efficient hardware continues to be challenging due to arithmetic complexities. The problem is further exacerbated in systems powered by energy harvesters as variable power levels can limit their computation capabilities. In this work, we propose a run-time configurable adaptive approximation method for multiplication that is capable of managing the energy and performance tradeoffs — ideally suited in these systems. Central to our approach is a Significance-Driven Logic Compression (SDLC) multiplier architecture that can dynamically adjust the level of approximation depending on the run-time power/accuracy constraints. The architecture can be configured to operate in the exact mode (no approximation) or in progressively higher approximation modes (i.e. 2 to 4-bit SDLC). Our method is implemented in both ASIC and FPGA. The implementation results indicate that our design has only a 2.3% silicon overhead, on top of what is required by a traditional exact multiplier. We evaluate the efficiency of the proposed design through a number of case studies. We show that our method achieves similar image fidelity as in the existing approximate methods, without a delay penalty. Further, the inclusion of the dynamic approximation techniques is justified by up to 62.6% energy savings when processing an image with a multiplier using 4-bit SDLC and 35% energy savings when using 2-bit SDLC. In addition, case study results show that the proposed approach incurs negligible loss in output quality with the worst PSNR of 30dB when using the 4-bit SDLC multiplier.
{"title":"Run-time Configurable Approximate Multiplier using Significance-Driven Logic Compression","authors":"Ibrahim Haddadi, Issa Qiqieh, R. Shafik, F. Xia, M. A. N. Al-hayanni, Alexandre Yakovlev","doi":"10.1109/ICCD53106.2021.00029","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00029","url":null,"abstract":"Designing energy-efficient hardware continues to be challenging due to arithmetic complexities. The problem is further exacerbated in systems powered by energy harvesters as variable power levels can limit their computation capabilities. In this work, we propose a run-time configurable adaptive approximation method for multiplication that is capable of managing the energy and performance tradeoffs — ideally suited in these systems. Central to our approach is a Significance-Driven Logic Compression (SDLC) multiplier architecture that can dynamically adjust the level of approximation depending on the run-time power/accuracy constraints. The architecture can be configured to operate in the exact mode (no approximation) or in progressively higher approximation modes (i.e. 2 to 4-bit SDLC). Our method is implemented in both ASIC and FPGA. The implementation results indicate that our design has only a 2.3% silicon overhead, on top of what is required by a traditional exact multiplier. We evaluate the efficiency of the proposed design through a number of case studies. We show that our method achieves similar image fidelity as in the existing approximate methods, without a delay penalty. Further, the inclusion of the dynamic approximation techniques is justified by up to 62.6% energy savings when processing an image with a multiplier using 4-bit SDLC and 35% energy savings when using 2-bit SDLC. In addition, case study results show that the proposed approach incurs negligible loss in output quality with the worst PSNR of 30dB when using the 4-bit SDLC multiplier.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125535610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00023
Tomoki Nakamura, Kazutaka Tomida, Shouta Kouno, H. Irie, S. Sakai
Approximate computing (AC) reduces power consumption and increases execution speed in exchange for computational accuracy. By adjusting the accuracy of approximation at runtime to reflect the optimal quality of the application, which changes constantly depending on the user’s cognitive ability and attention, AC achieves even higher efficiency. In this paper, we propose stochastic iterative approximation (SIA) that achieves dynamic and rapid control of the aggressiveness of the approximation. SIA executes a single binary code with multiple level of approximate aggressiveness that are dynamically adjusted. We propose a software implementation of SIA and hardware techniques to further improve the performance of SIA. We implement a compiler and a processor simulator for SIA as the dynamic approximation modules of RISC-V and evaluate their performance. Simulation results on six benchmarks show an adjustable trade-off between output quality and execution efficiency depending on the aggressiveness of the approximation in a single binary run.
{"title":"Stochastic Iterative Approximation: Software/hardware techniques for adjusting aggressiveness of approximation","authors":"Tomoki Nakamura, Kazutaka Tomida, Shouta Kouno, H. Irie, S. Sakai","doi":"10.1109/ICCD53106.2021.00023","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00023","url":null,"abstract":"Approximate computing (AC) reduces power consumption and increases execution speed in exchange for computational accuracy. By adjusting the accuracy of approximation at runtime to reflect the optimal quality of the application, which changes constantly depending on the user’s cognitive ability and attention, AC achieves even higher efficiency. In this paper, we propose stochastic iterative approximation (SIA) that achieves dynamic and rapid control of the aggressiveness of the approximation. SIA executes a single binary code with multiple level of approximate aggressiveness that are dynamically adjusted. We propose a software implementation of SIA and hardware techniques to further improve the performance of SIA. We implement a compiler and a processor simulator for SIA as the dynamic approximation modules of RISC-V and evaluate their performance. Simulation results on six benchmarks show an adjustable trade-off between output quality and execution efficiency depending on the aggressiveness of the approximation in a single binary run.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125984946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00079
Yujun Liu, Bing Wei, W. Jigang, Limin Xiao
Erasure code is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure coding based storage systems have to perform multi-block updates for partial writes of an erasure coding group, which leads to a large number of XOR operations. This paper presents an efficient approach, named ECMU, for erasure-coded multi-block update under a stringent latency by scheduling update sequences. ECMU takes a hybrid of reconstructed-write and read-modify-write for parity blocks of an erasure coding group, it dynamically selects the write scheme with the fewer XORs for each parity block to be updated, in order to reduce the number of XORs. ECMU iteratively retrieves the unmodified parity blocks to calculate the minimum XORs for each write scheme. For all parity blocks to be updated, after the write schemes are determined, ECMU performs the common XORs first, then it reuses the computational results to further reduce the number of XORs. ECMU caches a certain number of scheduling schemes to reduce the construction count of the scheduling schemes. Experimental results on real-world trace replaying show that the number of XORs and update time can be reduced significantly, compared with the state-of-the-art.
{"title":"Erasure-Coded Multi-Block Updates Based on Hybrid Writes and Common XORs First","authors":"Yujun Liu, Bing Wei, W. Jigang, Limin Xiao","doi":"10.1109/ICCD53106.2021.00079","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00079","url":null,"abstract":"Erasure code is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure coding based storage systems have to perform multi-block updates for partial writes of an erasure coding group, which leads to a large number of XOR operations. This paper presents an efficient approach, named ECMU, for erasure-coded multi-block update under a stringent latency by scheduling update sequences. ECMU takes a hybrid of reconstructed-write and read-modify-write for parity blocks of an erasure coding group, it dynamically selects the write scheme with the fewer XORs for each parity block to be updated, in order to reduce the number of XORs. ECMU iteratively retrieves the unmodified parity blocks to calculate the minimum XORs for each write scheme. For all parity blocks to be updated, after the write schemes are determined, ECMU performs the common XORs first, then it reuses the computational results to further reduce the number of XORs. ECMU caches a certain number of scheduling schemes to reduce the construction count of the scheduling schemes. Experimental results on real-world trace replaying show that the number of XORs and update time can be reduced significantly, compared with the state-of-the-art.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129191059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Post-quantum digital signature is a critical primitive of computer security in the era of quantum hegemony. As a finalist of the post-quantum cryptography standardization process, the theoretical security of the CRYSTALS-Dilithium (Dilithium) signature scheme has been quantified to withstand classical and quantum cryptanalysis. However, there is an inherent power side-channel information leakage in its implementation instance due to the physical characteristics of hardware.This work proposes an efficient non-profiled Correlation Power Analysis (CPA) strategy on Dilithium to recover the secret key by targeting the underlying polynomial multiplication arithmetic. We first develop a conservative scheme with a reduced key guess space, which can extract a secret key coefficient with a 99.99% confidence using 157 power traces of the reference Dilithium implementation. However, this scheme suffers from the computational overhead caused by the large modulus in Dilithium signature. To further accelerate the CPA run-time, we propose a fast two-stage scheme that selects a smaller search space and then resolves false positives. We finally construct a hybrid scheme that combines the advantages of both schemes. Real-world experiment on the power measurement data shows that our hybrid scheme improves the attack’s execution time by 7.77×.
{"title":"An Efficient Non-Profiled Side-Channel Attack on the CRYSTALS-Dilithium Post-Quantum Signature","authors":"Zhaohui Chen, Emre Karabulut, Aydin Aysu, Yuan Ma, Jiwu Jing","doi":"10.1109/ICCD53106.2021.00094","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00094","url":null,"abstract":"Post-quantum digital signature is a critical primitive of computer security in the era of quantum hegemony. As a finalist of the post-quantum cryptography standardization process, the theoretical security of the CRYSTALS-Dilithium (Dilithium) signature scheme has been quantified to withstand classical and quantum cryptanalysis. However, there is an inherent power side-channel information leakage in its implementation instance due to the physical characteristics of hardware.This work proposes an efficient non-profiled Correlation Power Analysis (CPA) strategy on Dilithium to recover the secret key by targeting the underlying polynomial multiplication arithmetic. We first develop a conservative scheme with a reduced key guess space, which can extract a secret key coefficient with a 99.99% confidence using 157 power traces of the reference Dilithium implementation. However, this scheme suffers from the computational overhead caused by the large modulus in Dilithium signature. To further accelerate the CPA run-time, we propose a fast two-stage scheme that selects a smaller search space and then resolves false positives. We finally construct a hybrid scheme that combines the advantages of both schemes. Real-world experiment on the power measurement data shows that our hybrid scheme improves the attack’s execution time by 7.77×.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127678659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00096
Yogendra Sao, Subidh Ali
Scan-based Design for Testability (DfT) is the de-facto standard for detecting manufacturing-related faults in chip manufacturing industries. The observability and accessibility provided by DfT can be misused to launch an attack to reveal the secret key, which is embedded inside a crypto chip. Several countermeasures have been proposed to protect the chip against scan-based attacks. Dynamic obfuscation of scan data prevents scan-based attacks by corrupting scan data in the case of unauthorized access. In this paper, we perform the security analysis of the above state-of-the-art obfuscation technique to showcase its vulnerabilities. Exploiting its vulnerabilities, we propose a scan-based signature attack on state-of-the-art obfuscation technique by applying a maximum of 4096 plaintexts and using only 220 signatures with a 100% success rate.
{"title":"Security Analysis of State-of-the-art Scan Obfuscation Technique","authors":"Yogendra Sao, Subidh Ali","doi":"10.1109/ICCD53106.2021.00096","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00096","url":null,"abstract":"Scan-based Design for Testability (DfT) is the de-facto standard for detecting manufacturing-related faults in chip manufacturing industries. The observability and accessibility provided by DfT can be misused to launch an attack to reveal the secret key, which is embedded inside a crypto chip. Several countermeasures have been proposed to protect the chip against scan-based attacks. Dynamic obfuscation of scan data prevents scan-based attacks by corrupting scan data in the case of unauthorized access. In this paper, we perform the security analysis of the above state-of-the-art obfuscation technique to showcase its vulnerabilities. Exploiting its vulnerabilities, we propose a scan-based signature attack on state-of-the-art obfuscation technique by applying a maximum of 4096 plaintexts and using only 220 signatures with a 100% success rate.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133833617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00083
Sobhan Niknam, A. Pathania, A. Pimentel
Power budgeting techniques allow thermally safe operation in multi-/many-core processors while still allowing for efficient exploitation of available thermal headroom. Core-level power budgeting techniques like Thermal Safe Power (TSP) have allowed for more efficient operations than chip-level power budgeting techniques like Thermal Design Power (TDP) since the liner granularity permits operations closer to the threshold temperature without thermal violations.State-of-the-art TSP bases its power budgeting calculations on the long-term steady-state temperature of cores while ignoring trends in their short-term transient temperature. In this paper, we propose a new power budgeting technique called T-TSP (Transient-Temperature-based Safe Power) that bases its calculation on the current temperature of the core, a detail ignored by TSP. T-TSP provides a dynamic power budget to a core, which inversely correlates with the core’s thermal headroom. Dynamic power budgeting with T-TSP allows cores to reach the threshold temperature faster than TSP and operate safely close to it in perpetuity. Therefore, it provides the same thermal guarantees as TSP but enables even more efficient exploitation of thermal headroom.We integrate T-TSP with a state-of-the-art thermal interval simulation toolchain. Our detailed evaluations show that benchmarks execute faster by up to 17.94% and 8.37% on average when we do power budgeting with T-TSP instead of the state-of- the-art TSP. Finally, we make T-TSP publicly available in both its integrated and stand-alone forms.
{"title":"T-TSP: Transient-Temperature Based Safe Power Budgeting in Multi-/Many-Core Processors","authors":"Sobhan Niknam, A. Pathania, A. Pimentel","doi":"10.1109/ICCD53106.2021.00083","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00083","url":null,"abstract":"Power budgeting techniques allow thermally safe operation in multi-/many-core processors while still allowing for efficient exploitation of available thermal headroom. Core-level power budgeting techniques like Thermal Safe Power (TSP) have allowed for more efficient operations than chip-level power budgeting techniques like Thermal Design Power (TDP) since the liner granularity permits operations closer to the threshold temperature without thermal violations.State-of-the-art TSP bases its power budgeting calculations on the long-term steady-state temperature of cores while ignoring trends in their short-term transient temperature. In this paper, we propose a new power budgeting technique called T-TSP (Transient-Temperature-based Safe Power) that bases its calculation on the current temperature of the core, a detail ignored by TSP. T-TSP provides a dynamic power budget to a core, which inversely correlates with the core’s thermal headroom. Dynamic power budgeting with T-TSP allows cores to reach the threshold temperature faster than TSP and operate safely close to it in perpetuity. Therefore, it provides the same thermal guarantees as TSP but enables even more efficient exploitation of thermal headroom.We integrate T-TSP with a state-of-the-art thermal interval simulation toolchain. Our detailed evaluations show that benchmarks execute faster by up to 17.94% and 8.37% on average when we do power budgeting with T-TSP instead of the state-of- the-art TSP. Finally, we make T-TSP publicly available in both its integrated and stand-alone forms.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134087271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}