首页 > 最新文献

2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文 中文
Intelligent Prediction of Flash Lifetime via Online Domain Adaptation 基于在线域自适应的Flash寿命智能预测
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00081
Ruixiang Ma, Fei Wu, Changsheng Xie
To resolve the low generalization ability of the flash lifetime model caused by a small training sample, we propose a multiple source ensemble online domain adaptation scheme, called MSE. MSE uses multiple offline source blocks to assist in establishing a lifetime prediction model for the online target block. MSE migrates information from these blocks to the target block, effectively solving the pain point of insufficient samples for the target block. We simulate the actual use scenarios of NAND flash on the FPGA-based test platform. Experimental results show that prediction accuracy of MSE exceeds 0.91 using only a small number of samples of the target block. Therefore, MSE can be used to improve the space utilization of the flash with low overhead.
针对flash寿命模型由于训练样本较少而泛化能力较差的问题,提出了一种多源集成在线域自适应方案(MSE)。MSE使用多个离线源块来帮助建立在线目标块的生命周期预测模型。MSE将这些块中的信息迁移到目标块中,有效地解决了目标块样本不足的痛点。我们在基于fpga的测试平台上模拟了NAND闪存的实际使用场景。实验结果表明,仅使用少量目标块样本,MSE的预测精度就超过0.91。因此,MSE可以在低开销的情况下提高闪存的空间利用率。
{"title":"Intelligent Prediction of Flash Lifetime via Online Domain Adaptation","authors":"Ruixiang Ma, Fei Wu, Changsheng Xie","doi":"10.1109/ICCD53106.2021.00081","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00081","url":null,"abstract":"To resolve the low generalization ability of the flash lifetime model caused by a small training sample, we propose a multiple source ensemble online domain adaptation scheme, called MSE. MSE uses multiple offline source blocks to assist in establishing a lifetime prediction model for the online target block. MSE migrates information from these blocks to the target block, effectively solving the pain point of insufficient samples for the target block. We simulate the actual use scenarios of NAND flash on the FPGA-based test platform. Experimental results show that prediction accuracy of MSE exceeds 0.91 using only a small number of samples of the target block. Therefore, MSE can be used to improve the space utilization of the flash with low overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"261 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133912427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Special Session: Approximate TinyML Systems: Full System Approximations for Extreme Energy-Efficiency in Intelligent Edge Devices 特别会议:近似TinyML系统:智能边缘设备中极端能效的全系统近似
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00015
Arnab Raha, Soumendu Kumar Ghosh, Debabrata Mohapatra, D. Mathaikutty, Raymond Sung, C. Brick, V. Raghunathan
Approximate computing (AxC) has advanced from being an emerging design paradigm to becoming one of the most popular and effective methods of energy optimization for applications in the domains of computer vision, image/video processing, data mining, analytics, and search. The simultaneous rise of artificial intelligence (AI) has provided an additional thrust to the adoption of various AxC techniques in intelligent edge platforms where energy-efficiency is not only desirable but necessary. In spite of the big rise in interest for AxC, the adoption of approximate hardware has mostly been limited to only one component of the system (usually the processing subsystem) which often contributes only a fraction of the overall system-level power. A full system approach to AxC enables us to extend approximations to other subsystems, such as the memory, sensor, and communications subsystems. This paper presents the foundational concepts of an approximate TinyML system that applies approximations synergistically to multiple subsystems in an edge inference device. These approximations are applied intelligently to significantly reduce energy while incurring a negligible loss in application-level quality. We demonstrate multiple versions of an approximate smart camera system that can execute state-of-the-art deep neural networks (DNNs) while consuming only a fraction of the total energy in a typical system.
近似计算(AxC)已经从一个新兴的设计范式发展成为在计算机视觉、图像/视频处理、数据挖掘、分析和搜索领域应用的最流行和最有效的能量优化方法之一。人工智能(AI)的同时兴起,为智能边缘平台采用各种AxC技术提供了额外的推动力,在这些平台中,能效不仅是可取的,而且是必要的。尽管对AxC的兴趣有了很大的提高,但近似硬件的采用大多仅限于系统的一个组件(通常是处理子系统),它通常只贡献了整个系统级功率的一小部分。AxC的完整系统方法使我们能够将近似扩展到其他子系统,例如存储器、传感器和通信子系统。本文介绍了近似TinyML系统的基本概念,该系统将近似协同应用于边缘推理设备中的多个子系统。这些近似被智能地应用,以显着减少能量,同时在应用级质量上产生可忽略不计的损失。我们展示了一个近似智能相机系统的多个版本,该系统可以执行最先进的深度神经网络(dnn),同时仅消耗典型系统中总能量的一小部分。
{"title":"Special Session: Approximate TinyML Systems: Full System Approximations for Extreme Energy-Efficiency in Intelligent Edge Devices","authors":"Arnab Raha, Soumendu Kumar Ghosh, Debabrata Mohapatra, D. Mathaikutty, Raymond Sung, C. Brick, V. Raghunathan","doi":"10.1109/ICCD53106.2021.00015","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00015","url":null,"abstract":"Approximate computing (AxC) has advanced from being an emerging design paradigm to becoming one of the most popular and effective methods of energy optimization for applications in the domains of computer vision, image/video processing, data mining, analytics, and search. The simultaneous rise of artificial intelligence (AI) has provided an additional thrust to the adoption of various AxC techniques in intelligent edge platforms where energy-efficiency is not only desirable but necessary. In spite of the big rise in interest for AxC, the adoption of approximate hardware has mostly been limited to only one component of the system (usually the processing subsystem) which often contributes only a fraction of the overall system-level power. A full system approach to AxC enables us to extend approximations to other subsystems, such as the memory, sensor, and communications subsystems. This paper presents the foundational concepts of an approximate TinyML system that applies approximations synergistically to multiple subsystems in an edge inference device. These approximations are applied intelligently to significantly reduce energy while incurring a negligible loss in application-level quality. We demonstrate multiple versions of an approximate smart camera system that can execute state-of-the-art deep neural networks (DNNs) while consuming only a fraction of the total energy in a typical system.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132663834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fast and Low-Cost Mitigation of ReRAM Variability for Deep Learning Applications 快速低成本缓解深度学习应用中的ReRAM可变性
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00051
Sugil Lee, M. Fouda, Jongeun Lee, A. Eltawil, Fadi J. Kurdahi
To overcome the programming variability (PV) of ReRAM crossbar arrays (RCAs), the most common method is program-verify, which, however, has high energy and latency overhead. In this paper we propose a very fast and low-cost method to mitigate the effect of PV and other variability for RCA-based DNN (Deep Neural Network) accelerators. Leveraging the statistical properties of DNN output, our method called Online Batch-Norm Correction (OBNC) can compensate for the effect of programming and other variability on RCA output without using on-chip training or an iterative procedure, and is thus very fast. Also our method does not require a nonideality model or a training dataset, hence very easy to apply. Our experimental results using ternary neural networks with binary and 4-bit activations demonstrate that our OBNC can recover the baseline performance in many variability settings and that our method outperforms a previously known method (VCAM) by large margins when input distribution is asymmetric or activation is multi-bit.
为了克服ReRAM交叉棒阵列(RCAs)的编程可变性(PV),最常用的方法是程序验证,但这种方法具有较高的能量和延迟开销。在本文中,我们提出了一种非常快速和低成本的方法来减轻PV和其他可变性对基于rca的DNN(深度神经网络)加速器的影响。利用DNN输出的统计特性,我们的方法称为在线批规范校正(OBNC),可以在不使用片上训练或迭代过程的情况下补偿编程和其他可变性对RCA输出的影响,因此非常快。此外,我们的方法不需要非理想模型或训练数据集,因此非常容易应用。我们使用二进制和4位激活的三元神经网络的实验结果表明,我们的OBNC可以在许多可变性设置中恢复基线性能,并且当输入分布不对称或激活是多比特时,我们的方法比以前已知的方法(VCAM)要好得多。
{"title":"Fast and Low-Cost Mitigation of ReRAM Variability for Deep Learning Applications","authors":"Sugil Lee, M. Fouda, Jongeun Lee, A. Eltawil, Fadi J. Kurdahi","doi":"10.1109/ICCD53106.2021.00051","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00051","url":null,"abstract":"To overcome the programming variability (PV) of ReRAM crossbar arrays (RCAs), the most common method is program-verify, which, however, has high energy and latency overhead. In this paper we propose a very fast and low-cost method to mitigate the effect of PV and other variability for RCA-based DNN (Deep Neural Network) accelerators. Leveraging the statistical properties of DNN output, our method called Online Batch-Norm Correction (OBNC) can compensate for the effect of programming and other variability on RCA output without using on-chip training or an iterative procedure, and is thus very fast. Also our method does not require a nonideality model or a training dataset, hence very easy to apply. Our experimental results using ternary neural networks with binary and 4-bit activations demonstrate that our OBNC can recover the baseline performance in many variability settings and that our method outperforms a previously known method (VCAM) by large margins when input distribution is asymmetric or activation is multi-bit.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131761834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
DynPaC: Coarse-Grained, Dynamic, and Partially Reconfigurable Array for Streaming Applications DynPaC:流应用的粗粒度、动态和部分可重构数组
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00018
Cheng Tan, Tong Geng, Chenhao Xie, Nicolas Bohm Agostini, Jiajia Li, Ang Li, K. Barker, Antonino Tumeo
Coarse-grained reconfigurable arrays (CGRAs) provide higher flexibility than application-specific integrated circuits (ASICs) and higher efficiency than fine-grained reconfigurable devices such as Field Programmable Gate Arrays (FPGAs). However, CGRAs are generally designed to support offloading of a single kernel. While their design, based on communicating functional units, appears to naturally suit streaming applications composed of multiple cooperating kernels, current approaches only statically partition the resources across kernels. However, streaming applications often are data-dependent, leading to variable kernel execution times depending on the input data and impacting the throughput of the entire pipeline if resources are statically allocated. Therefore, in this paper, we discuss the design of DynPaC — a coarse-grained, dynamically, and partially reconfigurable array for data-dependent streaming applications. We discuss the required software and hardware components to manage partial dynamic reconfiguration. We demonstrate that by supporting partial dynamic reconfiguration, we can obtain an average speedup of 1.44× for a representative set of applications w.r.t. static partitioning, with a limited area overhead (6.4% of the entire chip).
粗粒度可重构阵列(CGRAs)比专用集成电路(asic)具有更高的灵活性,比现场可编程门阵列(fpga)等细粒度可重构器件具有更高的效率。然而,CGRAs通常被设计为支持单个内核的卸载。虽然它们的设计基于通信功能单元,似乎很自然地适合由多个协作内核组成的流应用程序,但目前的方法只是在内核之间静态地划分资源。然而,流应用程序通常是依赖于数据的,这会导致内核执行时间的变化,这取决于输入数据,如果资源是静态分配的,则会影响整个管道的吞吐量。因此,在本文中,我们讨论了DynPaC的设计——一个用于数据依赖流应用程序的粗粒度、动态和部分可重构数组。我们讨论了管理局部动态重构所需的软件和硬件组件。我们证明,通过支持部分动态重新配置,相对于静态分区,我们可以在有限的面积开销(整个芯片的6.4%)下,为一组具有代表性的应用程序获得1.44倍的平均加速。
{"title":"DynPaC: Coarse-Grained, Dynamic, and Partially Reconfigurable Array for Streaming Applications","authors":"Cheng Tan, Tong Geng, Chenhao Xie, Nicolas Bohm Agostini, Jiajia Li, Ang Li, K. Barker, Antonino Tumeo","doi":"10.1109/ICCD53106.2021.00018","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00018","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) provide higher flexibility than application-specific integrated circuits (ASICs) and higher efficiency than fine-grained reconfigurable devices such as Field Programmable Gate Arrays (FPGAs). However, CGRAs are generally designed to support offloading of a single kernel. While their design, based on communicating functional units, appears to naturally suit streaming applications composed of multiple cooperating kernels, current approaches only statically partition the resources across kernels. However, streaming applications often are data-dependent, leading to variable kernel execution times depending on the input data and impacting the throughput of the entire pipeline if resources are statically allocated. Therefore, in this paper, we discuss the design of DynPaC — a coarse-grained, dynamically, and partially reconfigurable array for data-dependent streaming applications. We discuss the required software and hardware components to manage partial dynamic reconfiguration. We demonstrate that by supporting partial dynamic reconfiguration, we can obtain an average speedup of 1.44× for a representative set of applications w.r.t. static partitioning, with a limited area overhead (6.4% of the entire chip).","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123314958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Run-time Configurable Approximate Multiplier using Significance-Driven Logic Compression 使用显著性驱动逻辑压缩的运行时可配置近似乘法器
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00029
Ibrahim Haddadi, Issa Qiqieh, R. Shafik, F. Xia, M. A. N. Al-hayanni, Alexandre Yakovlev
Designing energy-efficient hardware continues to be challenging due to arithmetic complexities. The problem is further exacerbated in systems powered by energy harvesters as variable power levels can limit their computation capabilities. In this work, we propose a run-time configurable adaptive approximation method for multiplication that is capable of managing the energy and performance tradeoffs — ideally suited in these systems. Central to our approach is a Significance-Driven Logic Compression (SDLC) multiplier architecture that can dynamically adjust the level of approximation depending on the run-time power/accuracy constraints. The architecture can be configured to operate in the exact mode (no approximation) or in progressively higher approximation modes (i.e. 2 to 4-bit SDLC). Our method is implemented in both ASIC and FPGA. The implementation results indicate that our design has only a 2.3% silicon overhead, on top of what is required by a traditional exact multiplier. We evaluate the efficiency of the proposed design through a number of case studies. We show that our method achieves similar image fidelity as in the existing approximate methods, without a delay penalty. Further, the inclusion of the dynamic approximation techniques is justified by up to 62.6% energy savings when processing an image with a multiplier using 4-bit SDLC and 35% energy savings when using 2-bit SDLC. In addition, case study results show that the proposed approach incurs negligible loss in output quality with the worst PSNR of 30dB when using the 4-bit SDLC multiplier.
由于算法的复杂性,设计节能硬件仍然具有挑战性。在由能量采集器供电的系统中,由于可变功率水平会限制其计算能力,因此问题进一步加剧。在这项工作中,我们提出了一种运行时可配置的自适应近似乘法方法,该方法能够管理能量和性能权衡,非常适合这些系统。我们方法的核心是一个意义驱动逻辑压缩(SDLC)乘法器架构,它可以根据运行时功率/精度约束动态调整近似值的水平。该架构可以配置为在精确模式(无近似值)或逐步更高的近似值模式(即2至4位SDLC)下运行。我们的方法在ASIC和FPGA上都实现了。实现结果表明,在传统精确乘法器所需的基础上,我们的设计只有2.3%的硅开销。我们通过一些案例研究来评估所提出的设计的效率。结果表明,该方法与现有的近似方法具有相似的图像保真度,且没有延迟损失。此外,当使用4位SDLC处理带有乘法器的图像时,包含动态近似技术可以节省高达62.6%的能量,而当使用2位SDLC处理图像时可以节省35%的能量。此外,案例研究结果表明,当使用4位SDLC乘法器时,该方法的输出质量损失可以忽略不计,PSNR最差为30dB。
{"title":"Run-time Configurable Approximate Multiplier using Significance-Driven Logic Compression","authors":"Ibrahim Haddadi, Issa Qiqieh, R. Shafik, F. Xia, M. A. N. Al-hayanni, Alexandre Yakovlev","doi":"10.1109/ICCD53106.2021.00029","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00029","url":null,"abstract":"Designing energy-efficient hardware continues to be challenging due to arithmetic complexities. The problem is further exacerbated in systems powered by energy harvesters as variable power levels can limit their computation capabilities. In this work, we propose a run-time configurable adaptive approximation method for multiplication that is capable of managing the energy and performance tradeoffs — ideally suited in these systems. Central to our approach is a Significance-Driven Logic Compression (SDLC) multiplier architecture that can dynamically adjust the level of approximation depending on the run-time power/accuracy constraints. The architecture can be configured to operate in the exact mode (no approximation) or in progressively higher approximation modes (i.e. 2 to 4-bit SDLC). Our method is implemented in both ASIC and FPGA. The implementation results indicate that our design has only a 2.3% silicon overhead, on top of what is required by a traditional exact multiplier. We evaluate the efficiency of the proposed design through a number of case studies. We show that our method achieves similar image fidelity as in the existing approximate methods, without a delay penalty. Further, the inclusion of the dynamic approximation techniques is justified by up to 62.6% energy savings when processing an image with a multiplier using 4-bit SDLC and 35% energy savings when using 2-bit SDLC. In addition, case study results show that the proposed approach incurs negligible loss in output quality with the worst PSNR of 30dB when using the 4-bit SDLC multiplier.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125535610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Stochastic Iterative Approximation: Software/hardware techniques for adjusting aggressiveness of approximation 随机迭代逼近:用于调整逼近侵略性的软件/硬件技术
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00023
Tomoki Nakamura, Kazutaka Tomida, Shouta Kouno, H. Irie, S. Sakai
Approximate computing (AC) reduces power consumption and increases execution speed in exchange for computational accuracy. By adjusting the accuracy of approximation at runtime to reflect the optimal quality of the application, which changes constantly depending on the user’s cognitive ability and attention, AC achieves even higher efficiency. In this paper, we propose stochastic iterative approximation (SIA) that achieves dynamic and rapid control of the aggressiveness of the approximation. SIA executes a single binary code with multiple level of approximate aggressiveness that are dynamically adjusted. We propose a software implementation of SIA and hardware techniques to further improve the performance of SIA. We implement a compiler and a processor simulator for SIA as the dynamic approximation modules of RISC-V and evaluate their performance. Simulation results on six benchmarks show an adjustable trade-off between output quality and execution efficiency depending on the aggressiveness of the approximation in a single binary run.
近似计算(AC)可以降低功耗并提高执行速度,以换取计算精度。通过在运行时调整近似的准确性来反映应用程序的最佳质量,而应用程序的最佳质量会随着用户的认知能力和注意力而不断变化,AC实现了更高的效率。在本文中,我们提出了随机迭代逼近(SIA),实现了对逼近的侵略性的动态和快速控制。SIA执行具有动态调整的多个近似侵略性级别的单个二进制代码。我们提出了SIA的软件实现和硬件技术,以进一步提高SIA的性能。我们实现了SIA的编译器和处理器模拟器作为RISC-V的动态逼近模块,并评估了它们的性能。六个基准测试的模拟结果显示,输出质量和执行效率之间的权衡是可调整的,这取决于单个二进制运行中近似的侵略性。
{"title":"Stochastic Iterative Approximation: Software/hardware techniques for adjusting aggressiveness of approximation","authors":"Tomoki Nakamura, Kazutaka Tomida, Shouta Kouno, H. Irie, S. Sakai","doi":"10.1109/ICCD53106.2021.00023","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00023","url":null,"abstract":"Approximate computing (AC) reduces power consumption and increases execution speed in exchange for computational accuracy. By adjusting the accuracy of approximation at runtime to reflect the optimal quality of the application, which changes constantly depending on the user’s cognitive ability and attention, AC achieves even higher efficiency. In this paper, we propose stochastic iterative approximation (SIA) that achieves dynamic and rapid control of the aggressiveness of the approximation. SIA executes a single binary code with multiple level of approximate aggressiveness that are dynamically adjusted. We propose a software implementation of SIA and hardware techniques to further improve the performance of SIA. We implement a compiler and a processor simulator for SIA as the dynamic approximation modules of RISC-V and evaluate their performance. Simulation results on six benchmarks show an adjustable trade-off between output quality and execution efficiency depending on the aggressiveness of the approximation in a single binary run.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125984946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Erasure-Coded Multi-Block Updates Based on Hybrid Writes and Common XORs First 基于混合写和公共xor的擦除编码多块更新
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00079
Yujun Liu, Bing Wei, W. Jigang, Limin Xiao
Erasure code is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure coding based storage systems have to perform multi-block updates for partial writes of an erasure coding group, which leads to a large number of XOR operations. This paper presents an efficient approach, named ECMU, for erasure-coded multi-block update under a stringent latency by scheduling update sequences. ECMU takes a hybrid of reconstructed-write and read-modify-write for parity blocks of an erasure coding group, it dynamically selects the write scheme with the fewer XORs for each parity block to be updated, in order to reduce the number of XORs. ECMU iteratively retrieves the unmodified parity blocks to calculate the minimum XORs for each write scheme. For all parity blocks to be updated, after the write schemes are determined, ECMU performs the common XORs first, then it reuses the computational results to further reduce the number of XORs. ECMU caches a certain number of scheduling schemes to reduce the construction count of the scheduling schemes. Experimental results on real-world trace replaying show that the number of XORs and update time can be reduced significantly, compared with the state-of-the-art.
Erasure code相对于数据复制具有较高的可靠性和较低的冗余度,在存储系统中得到了广泛的应用。然而,基于擦除编码的存储系统必须对擦除编码组的部分写操作进行多块更新,这就导致了大量的异或操作。本文提出了一种有效的ECMU方法,通过调度更新序列,在严格的延迟条件下实现擦除编码的多块更新。ECMU对纠删编码组的奇偶校验块采用重构-写和读-修改-写的混合方式,对每个要更新的奇偶校验块动态选择xor较少的写方案,以减少xor的数量。ECMU迭代地检索未修改的奇偶校验块来计算每个写方案的最小xor。对于所有要更新的奇偶校验块,在确定写方案后,ECMU首先执行常见的xor,然后重用计算结果以进一步减少xor的数量。ECMU缓存一定数量的调度方案,以减少调度方案的构建次数。实际跟踪重放的实验结果表明,与最先进的方法相比,xor的数量和更新时间可以显着减少。
{"title":"Erasure-Coded Multi-Block Updates Based on Hybrid Writes and Common XORs First","authors":"Yujun Liu, Bing Wei, W. Jigang, Limin Xiao","doi":"10.1109/ICCD53106.2021.00079","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00079","url":null,"abstract":"Erasure code is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure coding based storage systems have to perform multi-block updates for partial writes of an erasure coding group, which leads to a large number of XOR operations. This paper presents an efficient approach, named ECMU, for erasure-coded multi-block update under a stringent latency by scheduling update sequences. ECMU takes a hybrid of reconstructed-write and read-modify-write for parity blocks of an erasure coding group, it dynamically selects the write scheme with the fewer XORs for each parity block to be updated, in order to reduce the number of XORs. ECMU iteratively retrieves the unmodified parity blocks to calculate the minimum XORs for each write scheme. For all parity blocks to be updated, after the write schemes are determined, ECMU performs the common XORs first, then it reuses the computational results to further reduce the number of XORs. ECMU caches a certain number of scheduling schemes to reduce the construction count of the scheduling schemes. Experimental results on real-world trace replaying show that the number of XORs and update time can be reduced significantly, compared with the state-of-the-art.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129191059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Efficient Non-Profiled Side-Channel Attack on the CRYSTALS-Dilithium Post-Quantum Signature 对crystals - diliium后量子签名的有效非侧信道攻击
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00094
Zhaohui Chen, Emre Karabulut, Aydin Aysu, Yuan Ma, Jiwu Jing
Post-quantum digital signature is a critical primitive of computer security in the era of quantum hegemony. As a finalist of the post-quantum cryptography standardization process, the theoretical security of the CRYSTALS-Dilithium (Dilithium) signature scheme has been quantified to withstand classical and quantum cryptanalysis. However, there is an inherent power side-channel information leakage in its implementation instance due to the physical characteristics of hardware.This work proposes an efficient non-profiled Correlation Power Analysis (CPA) strategy on Dilithium to recover the secret key by targeting the underlying polynomial multiplication arithmetic. We first develop a conservative scheme with a reduced key guess space, which can extract a secret key coefficient with a 99.99% confidence using 157 power traces of the reference Dilithium implementation. However, this scheme suffers from the computational overhead caused by the large modulus in Dilithium signature. To further accelerate the CPA run-time, we propose a fast two-stage scheme that selects a smaller search space and then resolves false positives. We finally construct a hybrid scheme that combines the advantages of both schemes. Real-world experiment on the power measurement data shows that our hybrid scheme improves the attack’s execution time by 7.77×.
后量子数字签名是量子霸权时代计算机安全的关键原语。作为后量子密码标准化过程的决赛选手,CRYSTALS-Dilithium(二锂)签名方案的理论安全性已经被量化,可以承受经典和量子密码分析。但由于硬件的物理特性,在其实现实例中存在固有的功率侧信道信息泄漏。本文提出了一种有效的非剖面相关功率分析(CPA)策略,通过针对底层的多项式乘法算法来恢复密钥。我们首先开发了一个保守的方案,减少了密钥猜测空间,该方案可以使用参考的diiliium实现的157功率迹线以99.99%的置信度提取密钥系数。然而,该方案存在着计算量大的缺点。为了进一步加快CPA的运行时间,我们提出了一种快速的两阶段方案,该方案选择较小的搜索空间,然后解决误报问题。最后构造了一种结合两种方案优点的混合方案。在功率测量数据上的实际实验表明,我们的混合方案将攻击的执行时间提高了7.77倍。
{"title":"An Efficient Non-Profiled Side-Channel Attack on the CRYSTALS-Dilithium Post-Quantum Signature","authors":"Zhaohui Chen, Emre Karabulut, Aydin Aysu, Yuan Ma, Jiwu Jing","doi":"10.1109/ICCD53106.2021.00094","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00094","url":null,"abstract":"Post-quantum digital signature is a critical primitive of computer security in the era of quantum hegemony. As a finalist of the post-quantum cryptography standardization process, the theoretical security of the CRYSTALS-Dilithium (Dilithium) signature scheme has been quantified to withstand classical and quantum cryptanalysis. However, there is an inherent power side-channel information leakage in its implementation instance due to the physical characteristics of hardware.This work proposes an efficient non-profiled Correlation Power Analysis (CPA) strategy on Dilithium to recover the secret key by targeting the underlying polynomial multiplication arithmetic. We first develop a conservative scheme with a reduced key guess space, which can extract a secret key coefficient with a 99.99% confidence using 157 power traces of the reference Dilithium implementation. However, this scheme suffers from the computational overhead caused by the large modulus in Dilithium signature. To further accelerate the CPA run-time, we propose a fast two-stage scheme that selects a smaller search space and then resolves false positives. We finally construct a hybrid scheme that combines the advantages of both schemes. Real-world experiment on the power measurement data shows that our hybrid scheme improves the attack’s execution time by 7.77×.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127678659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Security Analysis of State-of-the-art Scan Obfuscation Technique 最新扫描混淆技术的安全性分析
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00096
Yogendra Sao, Subidh Ali
Scan-based Design for Testability (DfT) is the de-facto standard for detecting manufacturing-related faults in chip manufacturing industries. The observability and accessibility provided by DfT can be misused to launch an attack to reveal the secret key, which is embedded inside a crypto chip. Several countermeasures have been proposed to protect the chip against scan-based attacks. Dynamic obfuscation of scan data prevents scan-based attacks by corrupting scan data in the case of unauthorized access. In this paper, we perform the security analysis of the above state-of-the-art obfuscation technique to showcase its vulnerabilities. Exploiting its vulnerabilities, we propose a scan-based signature attack on state-of-the-art obfuscation technique by applying a maximum of 4096 plaintexts and using only 220 signatures with a 100% success rate.
基于扫描的可测试性设计(DfT)是芯片制造行业检测制造相关故障的事实上的标准。DfT提供的可观察性和可访问性可以被滥用来发起攻击,以泄露嵌入在加密芯片中的密钥。已经提出了几种对策来保护芯片免受基于扫描的攻击。扫描数据的动态混淆可以防止非法访问扫描数据而导致扫描数据损坏的攻击。在本文中,我们对上述最先进的混淆技术进行安全分析,以展示其漏洞。利用其漏洞,我们提出了一种基于扫描的签名攻击,通过应用最多4096个明文,仅使用220个签名,成功率为100%。
{"title":"Security Analysis of State-of-the-art Scan Obfuscation Technique","authors":"Yogendra Sao, Subidh Ali","doi":"10.1109/ICCD53106.2021.00096","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00096","url":null,"abstract":"Scan-based Design for Testability (DfT) is the de-facto standard for detecting manufacturing-related faults in chip manufacturing industries. The observability and accessibility provided by DfT can be misused to launch an attack to reveal the secret key, which is embedded inside a crypto chip. Several countermeasures have been proposed to protect the chip against scan-based attacks. Dynamic obfuscation of scan data prevents scan-based attacks by corrupting scan data in the case of unauthorized access. In this paper, we perform the security analysis of the above state-of-the-art obfuscation technique to showcase its vulnerabilities. Exploiting its vulnerabilities, we propose a scan-based signature attack on state-of-the-art obfuscation technique by applying a maximum of 4096 plaintexts and using only 220 signatures with a 100% success rate.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133833617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
T-TSP: Transient-Temperature Based Safe Power Budgeting in Multi-/Many-Core Processors T-TSP:基于瞬态温度的多核/多核处理器安全功耗预算
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00083
Sobhan Niknam, A. Pathania, A. Pimentel
Power budgeting techniques allow thermally safe operation in multi-/many-core processors while still allowing for efficient exploitation of available thermal headroom. Core-level power budgeting techniques like Thermal Safe Power (TSP) have allowed for more efficient operations than chip-level power budgeting techniques like Thermal Design Power (TDP) since the liner granularity permits operations closer to the threshold temperature without thermal violations.State-of-the-art TSP bases its power budgeting calculations on the long-term steady-state temperature of cores while ignoring trends in their short-term transient temperature. In this paper, we propose a new power budgeting technique called T-TSP (Transient-Temperature-based Safe Power) that bases its calculation on the current temperature of the core, a detail ignored by TSP. T-TSP provides a dynamic power budget to a core, which inversely correlates with the core’s thermal headroom. Dynamic power budgeting with T-TSP allows cores to reach the threshold temperature faster than TSP and operate safely close to it in perpetuity. Therefore, it provides the same thermal guarantees as TSP but enables even more efficient exploitation of thermal headroom.We integrate T-TSP with a state-of-the-art thermal interval simulation toolchain. Our detailed evaluations show that benchmarks execute faster by up to 17.94% and 8.37% on average when we do power budgeting with T-TSP instead of the state-of- the-art TSP. Finally, we make T-TSP publicly available in both its integrated and stand-alone forms.
功率预算技术允许在多核/多核处理器中热安全运行,同时仍然允许有效利用可用的热余量。核心级功率预算技术(如热安全功率(TSP))比芯片级功率预算技术(如热设计功率(TDP))更有效,因为线性粒度允许更接近阈值温度的操作而不会违反热。目前最先进的TSP基于堆芯的长期稳态温度进行功率预算计算,而忽略了堆芯短期瞬态温度的变化趋势。在本文中,我们提出了一种新的功率预算技术,称为T-TSP(基于瞬态温度的安全功率),该技术基于铁芯当前温度的计算,这是TSP忽略的一个细节。T-TSP为核心提供动态功率预算,这与核心的热净空成反比。使用T-TSP的动态功率预算允许内核比TSP更快地达到阈值温度,并永久安全地接近它。因此,它提供了与TSP相同的热保证,但可以更有效地利用热净空。我们将T-TSP与最先进的热段模拟工具链集成在一起。我们的详细评估表明,当我们使用T-TSP而不是最先进的TSP进行电力预算时,基准测试的执行速度平均可提高17.94%和8.37%。最后,我们将T-TSP以集成和独立形式公开提供。
{"title":"T-TSP: Transient-Temperature Based Safe Power Budgeting in Multi-/Many-Core Processors","authors":"Sobhan Niknam, A. Pathania, A. Pimentel","doi":"10.1109/ICCD53106.2021.00083","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00083","url":null,"abstract":"Power budgeting techniques allow thermally safe operation in multi-/many-core processors while still allowing for efficient exploitation of available thermal headroom. Core-level power budgeting techniques like Thermal Safe Power (TSP) have allowed for more efficient operations than chip-level power budgeting techniques like Thermal Design Power (TDP) since the liner granularity permits operations closer to the threshold temperature without thermal violations.State-of-the-art TSP bases its power budgeting calculations on the long-term steady-state temperature of cores while ignoring trends in their short-term transient temperature. In this paper, we propose a new power budgeting technique called T-TSP (Transient-Temperature-based Safe Power) that bases its calculation on the current temperature of the core, a detail ignored by TSP. T-TSP provides a dynamic power budget to a core, which inversely correlates with the core’s thermal headroom. Dynamic power budgeting with T-TSP allows cores to reach the threshold temperature faster than TSP and operate safely close to it in perpetuity. Therefore, it provides the same thermal guarantees as TSP but enables even more efficient exploitation of thermal headroom.We integrate T-TSP with a state-of-the-art thermal interval simulation toolchain. Our detailed evaluations show that benchmarks execute faster by up to 17.94% and 8.37% on average when we do power budgeting with T-TSP instead of the state-of- the-art TSP. Finally, we make T-TSP publicly available in both its integrated and stand-alone forms.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134087271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2021 IEEE 39th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1