首页 > 最新文献

2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文 中文
PRL: Standardizing Performance Monitoring Library for High-Integrity Real-Time Systems 面向高完整性实时系统的标准化性能监控库
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00061
J. Giesen, E. Mezzetti, J. Abella, F. Cazorla
The use of complex processors is becoming ubiquitous in High-Integrity Systems (HIS). To deal with processor’s increased complexity, Performance Monitoring Counters (PMCs) are increasingly used to reason on software behavior and provide the necessary evidence to support software certification. However, the use of PMCs in HIS is relatively recent and hence far from being standardized. As a result, software engineers are forced to resort to highly-customized, low-level programming of platform-specific PMC control registers, which is both error prone and time consuming. To cover this gap, we propose building on the PAPI library, a standardized performance monitoring solution in the mainstream domain, and develop a PMC Reading Library (PRL) for configuring and collecting traceable events while capturing HIS specific requirements and peculiarities. We instantiate PRL in a reference automotive configuration to show that PRL meets key HIS requirements: negligible footprint, limited and predictable overhead, and accuracy collecting hardware events by filtering out the impact of interrupts and context switches.
复杂处理器的使用在高完整性系统(HIS)中变得无处不在。为了处理处理器日益增加的复杂性,性能监视计数器(pmc)越来越多地用于对软件行为进行推理并提供必要的证据来支持软件认证。然而,在卫生系统中使用pmc是最近的事,因此远未标准化。因此,软件工程师被迫求助于高度定制的、特定于平台的PMC控制寄存器的低级编程,这既容易出错又耗时。为了弥补这一差距,我们建议在主流领域建立一个标准化的性能监控解决方案PAPI库,并开发一个PMC阅读库(PRL),用于配置和收集可跟踪事件,同时捕获HIS的特定需求和特性。我们在参考汽车配置中实例化了PRL,以显示PRL满足关键的HIS需求:可忽略的占用空间、有限且可预测的开销,以及通过过滤中断和上下文切换的影响来准确收集硬件事件。
{"title":"PRL: Standardizing Performance Monitoring Library for High-Integrity Real-Time Systems","authors":"J. Giesen, E. Mezzetti, J. Abella, F. Cazorla","doi":"10.1109/ICCD53106.2021.00061","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00061","url":null,"abstract":"The use of complex processors is becoming ubiquitous in High-Integrity Systems (HIS). To deal with processor’s increased complexity, Performance Monitoring Counters (PMCs) are increasingly used to reason on software behavior and provide the necessary evidence to support software certification. However, the use of PMCs in HIS is relatively recent and hence far from being standardized. As a result, software engineers are forced to resort to highly-customized, low-level programming of platform-specific PMC control registers, which is both error prone and time consuming. To cover this gap, we propose building on the PAPI library, a standardized performance monitoring solution in the mainstream domain, and develop a PMC Reading Library (PRL) for configuring and collecting traceable events while capturing HIS specific requirements and peculiarities. We instantiate PRL in a reference automotive configuration to show that PRL meets key HIS requirements: negligible footprint, limited and predictable overhead, and accuracy collecting hardware events by filtering out the impact of interrupts and context switches.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124080387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Copyright 版权
Pub Date : 2021-10-01 DOI: 10.1109/iccd53106.2021.00003
{"title":"Copyright","authors":"","doi":"10.1109/iccd53106.2021.00003","DOIUrl":"https://doi.org/10.1109/iccd53106.2021.00003","url":null,"abstract":"","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127819425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HosNa: A DPC++ Benchmark Suite for Heterogeneous Architectures 面向异构架构的dpc++基准测试套件
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00084
Najmeh Nazari Bavarsad, Hosein Mohammadi Makrani, H. Sayadi, Lawrence Landis, S. Rafatirad, H. Homayoun
Most data centers equipped their general-purpose processors with hardware accelerators to reduce power consumption and improve utilization. Hardware accelerators offer highly energy-efficient computation for a wide range of applications; however, their programming is not as efficient as processors. To bridge the gap, Intel developed a cloud-based infrastructure called DevCloud that connects Intel® Xeon® Scalable Processors to GPUs and FPGAs to deliver high compute performance for emerging workloads. DevCloud assists developers with their compute-intensive tasks and provides access to precompiled software optimized for Intel® architecture. To reduce programming complexity and minimize the barriers to adopt new innovative hardware technology, Intel also provided a unified, cross-architecture programming model called oneAPI based on the Data-Parallel C++ (DPC++) language. In this paper, we introduce HosNa, the first DPC++ benchmark suite that can be used for the evaluation of the Intel FPGAs and DPC++ productivity. Moreover, we present the characterization of proposed benchmarks and the evaluation of implemented hardware accelerators in terms of speedup and latency.
大多数数据中心为其通用处理器配备了硬件加速器,以降低功耗并提高利用率。硬件加速器为广泛的应用提供高能效的计算;然而,它们的编程效率不如处理器。为了弥补这一差距,英特尔开发了一种名为DevCloud的基于云的基础设施,将英特尔®至强®可扩展处理器连接到gpu和fpga,为新兴工作负载提供高计算性能。DevCloud帮助开发人员完成计算密集型任务,并提供对针对英特尔®架构优化的预编译软件的访问。为了降低编程的复杂性,最大限度地减少采用新的创新硬件技术的障碍,英特尔还提供了一个统一的、跨架构的编程模型,称为基于数据并行c++ (dpc++)语言的oneAPI。在本文中,我们介绍了HosNa,第一个dpc++基准套件,可用于评估英特尔fpga和dpc++的生产力。此外,我们提出了提出的基准的特征,并在加速和延迟方面对实现的硬件加速器进行了评估。
{"title":"HosNa: A DPC++ Benchmark Suite for Heterogeneous Architectures","authors":"Najmeh Nazari Bavarsad, Hosein Mohammadi Makrani, H. Sayadi, Lawrence Landis, S. Rafatirad, H. Homayoun","doi":"10.1109/ICCD53106.2021.00084","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00084","url":null,"abstract":"Most data centers equipped their general-purpose processors with hardware accelerators to reduce power consumption and improve utilization. Hardware accelerators offer highly energy-efficient computation for a wide range of applications; however, their programming is not as efficient as processors. To bridge the gap, Intel developed a cloud-based infrastructure called DevCloud that connects Intel® Xeon® Scalable Processors to GPUs and FPGAs to deliver high compute performance for emerging workloads. DevCloud assists developers with their compute-intensive tasks and provides access to precompiled software optimized for Intel® architecture. To reduce programming complexity and minimize the barriers to adopt new innovative hardware technology, Intel also provided a unified, cross-architecture programming model called oneAPI based on the Data-Parallel C++ (DPC++) language. In this paper, we introduce HosNa, the first DPC++ benchmark suite that can be used for the evaluation of the Intel FPGAs and DPC++ productivity. Moreover, we present the characterization of proposed benchmarks and the evaluation of implemented hardware accelerators in terms of speedup and latency.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125290573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient Table-Based Polynomial on FPGA 基于FPGA的高效表多项式
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00066
Marco Barbone, B. W. Kwaadgras, U. Oelfke, W. Luk, G. Gaydadjiev
Field Programmable Gate Arrays (FPGAs) are gaining popularity in the context of scientific computing due to the recent advances of High-Level Synthesis (HLS) toolchains for customised hardware implementations combined with the increase in computing capabilities of modern FPGAs. As a result, developers are able to implement more complex scientific workloads which often require the evaluation of univariate numerical functions. In this study, we propose a methodology for table-based polynomial interpolation aiming at producing area-efficient implementations of such functions on FPGAs achieving the same accuracy and at similar performance as direct implementations. We also provide a rigorous error analysis to guarantee the correctness of the results. Our methodology covers the forecast of resource utilisation of the polynomial interpolator and, based on the characteristics of the function, guides the developer to the most area-efficient FPGA implementation. Our experiments show that in the case of a radiation spectrum of a Black Body application based on evaluating Planck’s Law, it is possible to reduce resource utilisation by up to 90% when compared to direct implementations not using table-based methods. Moreover, when only the kernels are considered, our method uses up to two orders of magnitude fewer resources with no performance penalties. Based on previous more theoretical works, our study investigates practical applications of table-based methods in the context of high performance and scientific computing where it is used to implement common but more complex functions than the elementary functions widely studied in the related literature.
现场可编程门阵列(fpga)在科学计算的背景下越来越受欢迎,这是由于用于定制硬件实现的高级综合(HLS)工具链的最新进展,以及现代fpga计算能力的提高。因此,开发人员能够实现更复杂的科学工作负载,这些工作负载通常需要对单变量数值函数进行评估。在本研究中,我们提出了一种基于表的多项式插值方法,旨在在fpga上实现这些函数的面积高效实现,实现与直接实现相同的精度和相似的性能。我们还提供严格的误差分析,以保证结果的正确性。我们的方法涵盖了多项式插值器的资源利用预测,并基于函数的特性,指导开发人员实现最具面积效率的FPGA。我们的实验表明,在基于评估普朗克定律的黑体应用的辐射谱的情况下,与不使用基于表的方法的直接实现相比,有可能将资源利用率降低高达90%。此外,当只考虑内核时,我们的方法使用的资源最多减少了两个数量级,而且没有性能损失。基于先前更多的理论工作,我们的研究探讨了基于表的方法在高性能和科学计算背景下的实际应用,它被用来实现比相关文献中广泛研究的基本函数更常见但更复杂的函数。
{"title":"Efficient Table-Based Polynomial on FPGA","authors":"Marco Barbone, B. W. Kwaadgras, U. Oelfke, W. Luk, G. Gaydadjiev","doi":"10.1109/ICCD53106.2021.00066","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00066","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are gaining popularity in the context of scientific computing due to the recent advances of High-Level Synthesis (HLS) toolchains for customised hardware implementations combined with the increase in computing capabilities of modern FPGAs. As a result, developers are able to implement more complex scientific workloads which often require the evaluation of univariate numerical functions. In this study, we propose a methodology for table-based polynomial interpolation aiming at producing area-efficient implementations of such functions on FPGAs achieving the same accuracy and at similar performance as direct implementations. We also provide a rigorous error analysis to guarantee the correctness of the results. Our methodology covers the forecast of resource utilisation of the polynomial interpolator and, based on the characteristics of the function, guides the developer to the most area-efficient FPGA implementation. Our experiments show that in the case of a radiation spectrum of a Black Body application based on evaluating Planck’s Law, it is possible to reduce resource utilisation by up to 90% when compared to direct implementations not using table-based methods. Moreover, when only the kernels are considered, our method uses up to two orders of magnitude fewer resources with no performance penalties. Based on previous more theoretical works, our study investigates practical applications of table-based methods in the context of high performance and scientific computing where it is used to implement common but more complex functions than the elementary functions widely studied in the related literature.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1917 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128008130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AdaptBit-HD: Adaptive Model Bitwidth for Hyperdimensional Computing AdaptBit-HD:用于超维计算的自适应模型位宽
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00026
Justin Morris, Si Thu Kaung Set, Gadi Rosen, M. Imani, Baris Aksanli, T. Simunic
Brain-inspired Hyperdimensional (HD) computing is a novel computing paradigm emulating the neuron’s activity in high-dimensional space. The first step in HD computing is to map each data point into high-dimensional space (e.g., 10,000). This poses several problems. For instance, the size of the data can explode and all subsequent operations need to be performed in parallel in D = 10,000 dimensions. Prior work alleviated this issue with model quantization. The HVs could then be stored in less space than the original data and lower bitwidth operations can be used to save energy. However, prior work quantized all samples to the same bitwidth. We propose, AdaptBit-HD, an Adaptive Model Bitwidth Architecture for accelerating HD Computing. AdaptBit-HD operates on the bits of the quantized model one bit at a time to save energy when fewer bits can be used to find the correct class. With AdaptBit-HD, we can achieve both high accuracy by utilizing all the bits when necessary and high energy efficiency by terminating execution at lower bits when our design is confident in the output. We additionally design an endto-end FPGA accelerator for AdaptBit-HD. Compared to 16-bit models, AdaptBit-HD is 14× more energy efficient and compared to binary models, AdaptBit-HD is 1.1% more accurate, which is comparable in accuracy to 16-bit models. This demonstrates that AdaptBit-HD is able to achieve the accuracy of full precision models, with the energy efficiency of binary models.
脑启发的超维计算是一种模拟高维空间中神经元活动的新型计算范式。高清计算的第一步是将每个数据点映射到高维空间(例如,10,000)。这带来了几个问题。例如,数据的大小可能会爆炸,所有后续操作需要在D = 10,000维中并行执行。先前的工作通过模型量化缓解了这个问题。然后,hv可以存储在比原始数据更小的空间中,并且可以使用更低的位宽操作来节省能源。然而,先前的工作将所有样本量化到相同的位宽。我们提出AdaptBit-HD,一种加速高清计算的自适应模型位宽架构。AdaptBit-HD每次对量子化模型的比特进行一个比特的操作,以节省能量,因为可以使用较少的比特来找到正确的类。使用AdaptBit-HD,我们可以通过在必要时利用所有比特来实现高精度,并且当我们的设计对输出有信心时,可以通过在较低比特处终止执行来实现高能效。我们还为AdaptBit-HD设计了端到端FPGA加速器。与16位模型相比,AdaptBit-HD的能效提高了14倍,与二进制模型相比,AdaptBit-HD的精度提高了1.1%,与16位模型的精度相当。这表明AdaptBit-HD能够达到全精度模型的精度,同时具有二进制模型的能量效率。
{"title":"AdaptBit-HD: Adaptive Model Bitwidth for Hyperdimensional Computing","authors":"Justin Morris, Si Thu Kaung Set, Gadi Rosen, M. Imani, Baris Aksanli, T. Simunic","doi":"10.1109/ICCD53106.2021.00026","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00026","url":null,"abstract":"Brain-inspired Hyperdimensional (HD) computing is a novel computing paradigm emulating the neuron’s activity in high-dimensional space. The first step in HD computing is to map each data point into high-dimensional space (e.g., 10,000). This poses several problems. For instance, the size of the data can explode and all subsequent operations need to be performed in parallel in D = 10,000 dimensions. Prior work alleviated this issue with model quantization. The HVs could then be stored in less space than the original data and lower bitwidth operations can be used to save energy. However, prior work quantized all samples to the same bitwidth. We propose, AdaptBit-HD, an Adaptive Model Bitwidth Architecture for accelerating HD Computing. AdaptBit-HD operates on the bits of the quantized model one bit at a time to save energy when fewer bits can be used to find the correct class. With AdaptBit-HD, we can achieve both high accuracy by utilizing all the bits when necessary and high energy efficiency by terminating execution at lower bits when our design is confident in the output. We additionally design an endto-end FPGA accelerator for AdaptBit-HD. Compared to 16-bit models, AdaptBit-HD is 14× more energy efficient and compared to binary models, AdaptBit-HD is 1.1% more accurate, which is comparable in accuracy to 16-bit models. This demonstrates that AdaptBit-HD is able to achieve the accuracy of full precision models, with the energy efficiency of binary models.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115229078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Accuracy and Efficiency of Posit Arithmetic 正数算法的精度和效率
Pub Date : 2021-09-16 DOI: 10.1109/ICCD53106.2021.00024
Ștefan-Dan Ciocîrlan, Dumitrel Loghin, Lavanya Ramapantulu, N. Tapus, Y. M. Teo
Motivated by the increasing interest in the posit numeric format, in this paper we evaluate the accuracy and efficiency of posit arithmetic in contrast to the traditional IEEE 754 32-bit floating-point (FP32) arithmetic. We first design and implement a Posit Arithmetic Unit (PAU), called POSAR, with flexible bit-sized arithmetic suitable for applications that can trade accuracy for savings in chip area. Next, we analyze the accuracy and efficiency of POSAR with a series of benchmarks including mathematical computations, ML kernels, NAS Parallel Benchmarks (NPB), and Cifar-10 CNN. This analysis is done on our implementation of POSAR integrated into a RISC-V Rocket Chip core in comparison with the IEEE 754-based Floting Point Unit (FPU) of Rocket Chip. Our analysis shows that POSAR can outperform the FPU, but the results are not spectacular. For NPB, 32-bit posit achieves better accuracy than FP32 and improves the execution by up to 2%. However, POSAR with 32-bit posit needs 30% more FPGA resources compared to the FPU. For classic ML algorithms, we find that 8-bit posits are not suitable to replace FP32 because they exhibit low accuracy leading to wrong results. Instead, 16-bit posit offers the best option in terms of accuracy and efficiency. For example, 16-bit posit achieves the same Top-1 accuracy as FP32 on a Cifar-10 CNN with a speedup of 18%.
由于人们对正数数字格式的兴趣日益浓厚,本文将正数算法的精度和效率与传统的IEEE 754 32位浮点(FP32)算法进行比较。我们首先设计并实现了Posit算术单元(PAU),称为POSAR,具有灵活的位大小算术,适用于可以以精度换取芯片面积节省的应用。接下来,我们通过数学计算、ML内核、NAS并行基准(NPB)和Cifar-10 CNN等一系列基准测试来分析POSAR的准确性和效率。本文分析了我们将POSAR集成到RISC-V Rocket Chip核心中的实现,并与基于IEEE 754的Rocket Chip浮点单元(FPU)进行了比较。我们的分析表明,POSAR可以胜过FPU,但结果并不惊人。对于NPB, 32位正位实现了比FP32更好的精度,并将执行率提高了2%。然而,与FPU相比,32位POSAR需要多30%的FPGA资源。对于经典的ML算法,我们发现8位位不适合取代FP32,因为它们具有低精度导致错误的结果。相反,16位位置提供了精度和效率方面的最佳选择。例如,16位posit在Cifar-10 CNN上实现了与FP32相同的Top-1精度,加速提高了18%。
{"title":"The Accuracy and Efficiency of Posit Arithmetic","authors":"Ștefan-Dan Ciocîrlan, Dumitrel Loghin, Lavanya Ramapantulu, N. Tapus, Y. M. Teo","doi":"10.1109/ICCD53106.2021.00024","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00024","url":null,"abstract":"Motivated by the increasing interest in the posit numeric format, in this paper we evaluate the accuracy and efficiency of posit arithmetic in contrast to the traditional IEEE 754 32-bit floating-point (FP32) arithmetic. We first design and implement a Posit Arithmetic Unit (PAU), called POSAR, with flexible bit-sized arithmetic suitable for applications that can trade accuracy for savings in chip area. Next, we analyze the accuracy and efficiency of POSAR with a series of benchmarks including mathematical computations, ML kernels, NAS Parallel Benchmarks (NPB), and Cifar-10 CNN. This analysis is done on our implementation of POSAR integrated into a RISC-V Rocket Chip core in comparison with the IEEE 754-based Floting Point Unit (FPU) of Rocket Chip. Our analysis shows that POSAR can outperform the FPU, but the results are not spectacular. For NPB, 32-bit posit achieves better accuracy than FP32 and improves the execution by up to 2%. However, POSAR with 32-bit posit needs 30% more FPGA resources compared to the FPU. For classic ML algorithms, we find that 8-bit posits are not suitable to replace FP32 because they exhibit low accuracy leading to wrong results. Instead, 16-bit posit offers the best option in terms of accuracy and efficiency. For example, 16-bit posit achieves the same Top-1 accuracy as FP32 on a Cifar-10 CNN with a speedup of 18%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131899710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
QFlow: Quantitative Information Flow for Security-Aware Hardware Design in Verilog 基于Verilog的安全感知硬件设计的定量信息流
Pub Date : 2021-09-06 DOI: 10.1109/ICCD53106.2021.00097
Lennart M. Reimann, Luca Hanel, Dominik Sisejkovic, Farhad Merchant, R. Leupers
The enormous amount of code required to design modern hardware implementations often leads to critical vulnerabilities being overlooked. Especially vulnerabilities that compromise the confidentiality of sensitive data, such as cryptographic keys, have a major impact on the trustworthiness of an entire system. Information flow analysis can elaborate whether information from sensitive signals flows towards outputs or untrusted components of the system. But most of these analytical strategies rely on the non-interference property, stating that the untrusted targets must not be influenced by the source’s data, which is shown to be too inflexible for many applications. To address this issue, there are approaches to quantify the information flow between components such that insignificant leakage can be neglected. Due to the high computational complexity of this quantification, approximations are needed, which introduce mispredictions. To tackle those limitations, we reformulate the approximations. Further, we propose a tool QFlow with a higher detection rate than previous tools. It can be used by non-experienced users to identify data leakages in hardware designs, thus facilitating a security-aware design process.
设计现代硬件实现所需的大量代码常常导致关键漏洞被忽视。特别是危及敏感数据(如加密密钥)机密性的漏洞,会对整个系统的可信度产生重大影响。信息流分析可以详细说明来自敏感信号的信息是否流向系统的输出或不受信任的组件。但这些分析策略大多依赖于非干扰特性,即不可信的目标必须不受源数据的影响,这对于许多应用来说过于不灵活。为了解决这个问题,有一些方法可以量化组件之间的信息流,这样可以忽略不重要的泄漏。由于这种量化的计算复杂度很高,需要近似值,而近似值会引入错误预测。为了解决这些限制,我们重新制定了近似。此外,我们提出了一个比以前的工具具有更高检测率的工具QFlow。没有经验的用户可以使用它来识别硬件设计中的数据泄漏,从而促进具有安全意识的设计过程。
{"title":"QFlow: Quantitative Information Flow for Security-Aware Hardware Design in Verilog","authors":"Lennart M. Reimann, Luca Hanel, Dominik Sisejkovic, Farhad Merchant, R. Leupers","doi":"10.1109/ICCD53106.2021.00097","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00097","url":null,"abstract":"The enormous amount of code required to design modern hardware implementations often leads to critical vulnerabilities being overlooked. Especially vulnerabilities that compromise the confidentiality of sensitive data, such as cryptographic keys, have a major impact on the trustworthiness of an entire system. Information flow analysis can elaborate whether information from sensitive signals flows towards outputs or untrusted components of the system. But most of these analytical strategies rely on the non-interference property, stating that the untrusted targets must not be influenced by the source’s data, which is shown to be too inflexible for many applications. To address this issue, there are approaches to quantify the information flow between components such that insignificant leakage can be neglected. Due to the high computational complexity of this quantification, approximations are needed, which introduce mispredictions. To tackle those limitations, we reformulate the approximations. Further, we propose a tool QFlow with a higher detection rate than previous tools. It can be used by non-experienced users to identify data leakages in hardware designs, thus facilitating a security-aware design process.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131470062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Consistent RDMA-Friendly Hashing on Remote Persistent Memory 远程持久内存上一致的rdma友好哈希
Pub Date : 2021-07-14 DOI: 10.1109/ICCD53106.2021.00037
Xinxin Liu, Yu Hua, Rong Bai
Coalescing RDMA and Persistent Memory (PM) delivers high end-to-end performance for networked storage systems, which requires rethinking the design of efficient hash structures. In general, existing hashing schemes separately opti-mize RDMA and PM, thus partially addressing the problems of RDMA Access Amplification and High-Overhead PM Consistency. In order to address these problems, we propose a continuity hashing, which is a "one-stone-two-birds" design to optimize both RDMA and PM. The continuity hashing leverages a fine-grained contiguous shared region, called SBuckets, to provide standby positions for the neighbouring two buckets in case of hash collisions. In the continuity hashing, remote read only needs a single RDMA read to directly fetch the home bucket and the neighbouring SBuckets, which contain all the positions of maintaining a key-value item, thus alleviating RDMA access amplification. Continuity hashing further leverages indicators that can be atomically modified to support log-free PM consistency for all the write operations. Evaluation results demonstrate that compared with state-of-the-art techniques, continuity hashing achieves high throughput, low latency and the smallest number of PM writes with acceptable load factors.
合并RDMA和Persistent Memory (PM)为网络存储系统提供了高的端到端性能,这需要重新考虑高效哈希结构的设计。一般来说,现有的散列方案分别对RDMA和PM进行优化,从而部分解决了RDMA接入放大和高开销PM一致性的问题。为了解决这些问题,我们提出了连续性散列,这是一种“一石二鸟”的设计,可以同时优化RDMA和PM。连续性散列利用一个称为SBuckets的细粒度连续共享区域,在发生散列冲突时为相邻的两个bucket提供备用位置。在连续性哈希中,远程读只需要一次RDMA读就可以直接获取主桶和邻近的sbucket,其中包含了维护键值项的所有位置,从而减轻了RDMA访问的放大。连续性散列进一步利用了可以自动修改的指示器,以支持所有写操作的无日志PM一致性。评估结果表明,与最先进的技术相比,连续性哈希在可接受的负载因子下实现了高吞吐量、低延迟和最小数量的PM写入。
{"title":"Consistent RDMA-Friendly Hashing on Remote Persistent Memory","authors":"Xinxin Liu, Yu Hua, Rong Bai","doi":"10.1109/ICCD53106.2021.00037","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00037","url":null,"abstract":"Coalescing RDMA and Persistent Memory (PM) delivers high end-to-end performance for networked storage systems, which requires rethinking the design of efficient hash structures. In general, existing hashing schemes separately opti-mize RDMA and PM, thus partially addressing the problems of RDMA Access Amplification and High-Overhead PM Consistency. In order to address these problems, we propose a continuity hashing, which is a \"one-stone-two-birds\" design to optimize both RDMA and PM. The continuity hashing leverages a fine-grained contiguous shared region, called SBuckets, to provide standby positions for the neighbouring two buckets in case of hash collisions. In the continuity hashing, remote read only needs a single RDMA read to directly fetch the home bucket and the neighbouring SBuckets, which contain all the positions of maintaining a key-value item, thus alleviating RDMA access amplification. Continuity hashing further leverages indicators that can be atomically modified to support log-free PM consistency for all the write operations. Evaluation results demonstrate that compared with state-of-the-art techniques, continuity hashing achieves high throughput, low latency and the smallest number of PM writes with acceptable load factors.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125220701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network 基于rram的稀疏乘法引擎压缩神经网络的位稀疏性
Pub Date : 2021-03-02 DOI: 10.1109/ICCD53106.2021.00072
Fangxin Liu, Wenbo Zhao, Yilong Zhao, Zongwu Wang, Tao Yang, Zhezhi He, Naifeng Jing, Xiaoyao Liang, Li Jiang
Resistive Random-Access-Memory (ReRAM) cross-bar is a promising technique for deep neural network (DNN) accelerators, thanks to its in-memory and in-situ analog computing abilities for Vector-Matrix Multiplication-and-Accumulations (VMMs). However, it is challenging for crossbar architecture to exploit the sparsity in DNNs. It inevitably causes complex and costly control to exploit fine-grained sparsity due to the limitation of tightly-coupled crossbar structure.As the countermeasure, we develop a novel ReRAM-based DNN accelerator, named Sparse-Multiplication-Engine (SME), based on a hardware and software co-design framework. First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods. Second, we propose a novel weight mapping mechanism to slice the bits of a weight across the crossbars and splice the activation results in peripheral circuits. This mechanism can decouple the tightly-coupled crossbar structure and cumulate the sparsity in the crossbar. Finally, a superior squeeze-out scheme empties the crossbars mapped with highly-sparse non-zeros from the previous two steps. We design the SME architecture and discuss its use for other quantization methods and different ReRAM cell technologies. Compared with prior state-of-the-art designs, the SME shrinks the use of crossbars up to 8.7× and 2.1× using ResNet-50 and MobileNet-v2, respectively, with ≤ 0.3% accuracy drop on ImageNet.
电阻随机存取存储器(ReRAM)交叉棒是一种很有前途的深度神经网络(DNN)加速器技术,由于其在内存和原位模拟计算向量矩阵乘法和累积(vmm)的能力。然而,crossbar架构很难利用深度神经网络的稀疏性。由于紧耦合的横杆结构的限制,利用细粒度稀疏性的控制不可避免地会造成复杂和昂贵的控制。作为对策,我们基于硬件和软件协同设计框架,开发了一种新的基于rerram的深度神经网络加速器,命名为稀疏乘法引擎(SME)。首先,我们在现有量化方法的基础上编排了位稀疏模式以增加位稀疏密度。其次,我们提出了一种新的权值映射机制,将权值的比特分割到横条上,并将激活结果拼接到外围电路中。该机制可以解耦紧耦合的横杆结构,并在横杆中积累稀疏性。最后,一种优越的挤出方案清空前两步中由高度稀疏的非零映射的交叉条。我们设计了SME架构,并讨论了它在其他量化方法和不同的ReRAM单元技术中的应用。与之前最先进的设计相比,SME在使用ResNet-50和MobileNet-v2时将横梁的使用分别减少了8.7倍和2.1倍,在使用ImageNet时精度下降≤0.3%。
{"title":"SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network","authors":"Fangxin Liu, Wenbo Zhao, Yilong Zhao, Zongwu Wang, Tao Yang, Zhezhi He, Naifeng Jing, Xiaoyao Liang, Li Jiang","doi":"10.1109/ICCD53106.2021.00072","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00072","url":null,"abstract":"Resistive Random-Access-Memory (ReRAM) cross-bar is a promising technique for deep neural network (DNN) accelerators, thanks to its in-memory and in-situ analog computing abilities for Vector-Matrix Multiplication-and-Accumulations (VMMs). However, it is challenging for crossbar architecture to exploit the sparsity in DNNs. It inevitably causes complex and costly control to exploit fine-grained sparsity due to the limitation of tightly-coupled crossbar structure.As the countermeasure, we develop a novel ReRAM-based DNN accelerator, named Sparse-Multiplication-Engine (SME), based on a hardware and software co-design framework. First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods. Second, we propose a novel weight mapping mechanism to slice the bits of a weight across the crossbars and splice the activation results in peripheral circuits. This mechanism can decouple the tightly-coupled crossbar structure and cumulate the sparsity in the crossbar. Finally, a superior squeeze-out scheme empties the crossbars mapped with highly-sparse non-zeros from the previous two steps. We design the SME architecture and discuss its use for other quantization methods and different ReRAM cell technologies. Compared with prior state-of-the-art designs, the SME shrinks the use of crossbars up to 8.7× and 2.1× using ResNet-50 and MobileNet-v2, respectively, with ≤ 0.3% accuracy drop on ImageNet.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130538875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2021 IEEE 39th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1