首页 > 最新文献

2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文 中文
Universal Neural Network Acceleration via Real-Time Loop Blocking 基于实时环路阻塞的通用神经网络加速
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00053
Jiaqi Zhang, Xiangru Chen, S. Ray
There is a recent trend that the DNN workloads and accelerators are increasingly heterogeneous and dynamic. Existing DNN acceleration solutions fail to address these challenges because they either rely on complicated ad hoc mapping or clumpy exhaustive search. To this end, this paper first proposes a formalization model that can comprehensively describe the accelerator design space. Instead of enforcing certain customized dataflows, the proposed model explicitly captures the intrinsic hardware functions of a given accelerator. We connect these functions with the data reuse opportunities of the DNN computation and build a correspondence between DNN loop blocking and accelerator constraints. Based on this, we implement an algorithm that efficiently and effectively performs universal loop blocking for various DNNs and accelerators without manual specifications. The evaluation shows that our results manifest 2.1x and 1.5x speedup and energy efficiency over dataflow-defined algorithm as well as significant improvement in blocking latency compared with search-based methods.
最近的一个趋势是,深度神经网络工作负载和加速器越来越异构和动态。现有的深度神经网络加速解决方案无法解决这些挑战,因为它们要么依赖于复杂的临时映射,要么依赖于团块穷举搜索。为此,本文首先提出了一个能够全面描述加速器设计空间的形式化模型。提议的模型没有强制执行特定的定制数据流,而是显式地捕获给定加速器的固有硬件功能。我们将这些函数与深度神经网络计算的数据重用机会联系起来,并在深度神经网络环路阻塞和加速器约束之间建立了对应关系。在此基础上,我们实现了一种算法,该算法可以有效地执行各种dnn和加速器的通用循环阻塞,而无需手动规范。评估表明,与基于搜索的方法相比,我们的结果比数据流定义的算法有2.1倍和1.5倍的加速和能源效率,并且在阻塞延迟方面有显著改善。
{"title":"Universal Neural Network Acceleration via Real-Time Loop Blocking","authors":"Jiaqi Zhang, Xiangru Chen, S. Ray","doi":"10.1109/ICCD53106.2021.00053","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00053","url":null,"abstract":"There is a recent trend that the DNN workloads and accelerators are increasingly heterogeneous and dynamic. Existing DNN acceleration solutions fail to address these challenges because they either rely on complicated ad hoc mapping or clumpy exhaustive search. To this end, this paper first proposes a formalization model that can comprehensively describe the accelerator design space. Instead of enforcing certain customized dataflows, the proposed model explicitly captures the intrinsic hardware functions of a given accelerator. We connect these functions with the data reuse opportunities of the DNN computation and build a correspondence between DNN loop blocking and accelerator constraints. Based on this, we implement an algorithm that efficiently and effectively performs universal loop blocking for various DNNs and accelerators without manual specifications. The evaluation shows that our results manifest 2.1x and 1.5x speedup and energy efficiency over dataflow-defined algorithm as well as significant improvement in blocking latency compared with search-based methods.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"54 82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123503790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks 通过持久和弹性块开发gpu中的sm内并行性
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00054
Han Zhao, Weihao Cui, Quan Chen, Jieru Zhao, Jingwen Leng, M. Guo
Emerging GPUs have multiple Streaming Multiprocessors (SM), while each SM is comprised of CUDA Cores and Tensor Cores. While CUDA Cores do the general computation, Tensor Cores are designed to speed up matrix multiplication for deep learning applications. However, a GPU kernel often either uses CUDA Cores or Tensor Cores, leaving the other processing units idle. Although many prior research works have been proposed to co-locate kernels to improve GPU utilization, they cannot leverage the Intra-SM CUDA Core-Tensor Core Parallelism. We therefore propose Plasticine to exploit the intra-SM parallelism for maximizing the GPU throughput. Plasticine involves compilation and runtime schedule to achieve the above purpose. Experimental results on an Nvidia 2080Ti GPU show that Plasticine improves the system-wide throughput by 15.3% compared with prior co-location work.
新兴的gpu有多个流多处理器(SM),而每个SM由CUDA核心和张量核心组成。虽然CUDA内核做一般计算,但Tensor内核的设计是为了加速深度学习应用的矩阵乘法。然而,GPU内核通常要么使用CUDA核心,要么使用张量核心,而让其他处理单元闲置。尽管许多先前的研究工作已经提出了共定位内核以提高GPU利用率,但它们无法利用sm内CUDA核心-张量核心并行性。因此,我们建议Plasticine利用sm内并行性来最大化GPU吞吐量。橡皮泥涉及到编译和运行时调度来实现上述目的。在Nvidia 2080Ti GPU上的实验结果表明,与之前的协同定位工作相比,Plasticine将系统范围的吞吐量提高了15.3%。
{"title":"Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks","authors":"Han Zhao, Weihao Cui, Quan Chen, Jieru Zhao, Jingwen Leng, M. Guo","doi":"10.1109/ICCD53106.2021.00054","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00054","url":null,"abstract":"Emerging GPUs have multiple Streaming Multiprocessors (SM), while each SM is comprised of CUDA Cores and Tensor Cores. While CUDA Cores do the general computation, Tensor Cores are designed to speed up matrix multiplication for deep learning applications. However, a GPU kernel often either uses CUDA Cores or Tensor Cores, leaving the other processing units idle. Although many prior research works have been proposed to co-locate kernels to improve GPU utilization, they cannot leverage the Intra-SM CUDA Core-Tensor Core Parallelism. We therefore propose Plasticine to exploit the intra-SM parallelism for maximizing the GPU throughput. Plasticine involves compilation and runtime schedule to achieve the above purpose. Experimental results on an Nvidia 2080Ti GPU show that Plasticine improves the system-wide throughput by 15.3% compared with prior co-location work.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124687084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
ReSpar: Reordering Algorithm for ReRAM-based Sparse Matrix-Vector Multiplication Accelerator ReSpar:基于rerram的稀疏矩阵向量乘法加速器的重排序算法
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00050
Yi-Jou Hsiao, Chin-Fu Nien, Hsiang-Yun Cheng
Sparse matrix-vector multiplication (SpMV) serves as a crucial operation for several key application domains, such as graph analytics and scientific computing, in the era of big data. The performance of SpMV is bounded by the data transmissions across memory channels in conventional von Neumann systems. Emerging metal-oxide resistive random access memory (ReRAM) has shown its potential to address this memory wall challenge through performing SpMV directly within its crossbar arrays. However, due to the tightly coupled crossbar structure, it is unlikely to skip all redundant data loading and computations with zero-valued entries of the sparse matrix in such ReRAM-based processing-in-memory architecture. These unnecessary ReRAM writes and computations hurt the energy efficiency. As only the crossbar-sized sub-matrices with full-zero entries can be skipped, prior studies have proposed some matrix reordering methods to aggregate non-zero entries to few crossbar arrays, such that more full-zero crossbar arrays can be skipped. Nevertheless, the effectiveness of prior reordering methods is constrained by the original ordering of matrix rows. In this paper, we show that the amount of full-zero sub-matrices derived by these prior studies are less than a theoretical lower bound in some cases, indicating that there are still rooms for improvement. Hence, we propose a novel reordering algorithm, ReSpar, that aims to aggregate matrix rows with similar non-zero column entries together and concentrates the non-zeros columns to increase the zero-skipping opportunities. Results show that ReSpar achieves 1.68× and 1.37× more energy savings, while reducing the required number of crossbar loads by 40.4% and 27.2% on average.
在大数据时代,稀疏矩阵向量乘法(SpMV)是图形分析和科学计算等几个关键应用领域的关键运算。在传统的冯·诺依曼系统中,SpMV的性能受到存储通道间数据传输的限制。新兴的金属氧化物电阻性随机存取存储器(ReRAM)通过在其横条阵列中直接执行SpMV,显示出其解决这一存储壁挑战的潜力。然而,在这种基于reram的内存处理架构中,由于采用了紧耦合的横杆结构,不太可能跳过所有冗余的数据加载和稀疏矩阵零值条目的计算。这些不必要的ReRAM写入和计算损害了能源效率。由于只能跳过具有全零条目的横条大小的子矩阵,因此已有研究提出了一些矩阵重排序方法,将非零条目聚合到少数横条数组中,从而可以跳过更多的全零交叉栏数组。然而,先前的重排序方法的有效性受到矩阵原始排序的约束。在本文中,我们证明了这些先前的研究在某些情况下推导出的全零子矩阵的数量小于理论下界,这表明仍有改进的余地。因此,我们提出了一种新的重排序算法ReSpar,该算法旨在将具有相似非零列条目的矩阵行聚集在一起,并集中非零列以增加跳零的机会。结果表明,ReSpar分别实现了1.68倍和1.37倍的节能效果,同时平均减少了40.4%和27.2%的横杆载荷数量。
{"title":"ReSpar: Reordering Algorithm for ReRAM-based Sparse Matrix-Vector Multiplication Accelerator","authors":"Yi-Jou Hsiao, Chin-Fu Nien, Hsiang-Yun Cheng","doi":"10.1109/ICCD53106.2021.00050","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00050","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) serves as a crucial operation for several key application domains, such as graph analytics and scientific computing, in the era of big data. The performance of SpMV is bounded by the data transmissions across memory channels in conventional von Neumann systems. Emerging metal-oxide resistive random access memory (ReRAM) has shown its potential to address this memory wall challenge through performing SpMV directly within its crossbar arrays. However, due to the tightly coupled crossbar structure, it is unlikely to skip all redundant data loading and computations with zero-valued entries of the sparse matrix in such ReRAM-based processing-in-memory architecture. These unnecessary ReRAM writes and computations hurt the energy efficiency. As only the crossbar-sized sub-matrices with full-zero entries can be skipped, prior studies have proposed some matrix reordering methods to aggregate non-zero entries to few crossbar arrays, such that more full-zero crossbar arrays can be skipped. Nevertheless, the effectiveness of prior reordering methods is constrained by the original ordering of matrix rows. In this paper, we show that the amount of full-zero sub-matrices derived by these prior studies are less than a theoretical lower bound in some cases, indicating that there are still rooms for improvement. Hence, we propose a novel reordering algorithm, ReSpar, that aims to aggregate matrix rows with similar non-zero column entries together and concentrates the non-zeros columns to increase the zero-skipping opportunities. Results show that ReSpar achieves 1.68× and 1.37× more energy savings, while reducing the required number of crossbar loads by 40.4% and 27.2% on average.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"104 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127438548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive Failure Analysis against Soft Errors from Hardware and Software Perspectives 从硬件和软件的角度对软错误进行综合故障分析
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00041
Yohan Ko, Hwisoo So, Jinhyo Jung, Kyoungwoo Lee, Aviral Shrivastava
With technology scaling, reliability against soft errors is becoming an important design concern for modern embedded systems. To avoid the high cost and performance overheads of full protection techniques, several researches have therefore turned their focus to selective protection techniques. This increases the need to accurately identify the most vulnerable components or instructions in a system. In this paper, we analyze the vulnerability of a system from both the hardware and software perspectives through intensive fault injection trials. From the hardware perspective, we find the most vulnerable hardware components by calculating component-wise failure rates. From the software perspective, we identify the most vulnerable instructions by using the novel root cause instruction analysis. With our results, we show that it is possible to reduce the failure rate of a system to only 12.40% with minimal protection.
随着技术的发展,对软错误的可靠性已成为现代嵌入式系统设计的一个重要问题。为了避免全保护技术的高成本和性能开销,一些研究因此将重点转向选择性保护技术。这增加了准确识别系统中最易受攻击的组件或指令的需求。本文通过大量的故障注入试验,从硬件和软件两个角度分析了系统的脆弱性。从硬件的角度来看,我们通过计算组件的故障率来找到最脆弱的硬件组件。从软件的角度,我们通过使用新的根本原因指令分析来识别最易受攻击的指令。根据我们的结果,我们表明在最小保护下将系统故障率降低到12.40%是可能的。
{"title":"Comprehensive Failure Analysis against Soft Errors from Hardware and Software Perspectives","authors":"Yohan Ko, Hwisoo So, Jinhyo Jung, Kyoungwoo Lee, Aviral Shrivastava","doi":"10.1109/ICCD53106.2021.00041","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00041","url":null,"abstract":"With technology scaling, reliability against soft errors is becoming an important design concern for modern embedded systems. To avoid the high cost and performance overheads of full protection techniques, several researches have therefore turned their focus to selective protection techniques. This increases the need to accurately identify the most vulnerable components or instructions in a system. In this paper, we analyze the vulnerability of a system from both the hardware and software perspectives through intensive fault injection trials. From the hardware perspective, we find the most vulnerable hardware components by calculating component-wise failure rates. From the software perspective, we identify the most vulnerable instructions by using the novel root cause instruction analysis. With our results, we show that it is possible to reduce the failure rate of a system to only 12.40% with minimal protection.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127455223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Empirical Guide to Use of Persistent Memory for Large-Scale In-Memory Graph Analysis 使用持久内存进行大规模内存图分析的经验指南
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00057
Hanyeoreum Bae, Miryeong Kwon, Donghyun Gouk, Sanghyun Han, Sungjoon Koh, Changrim Lee, Dongchul Park, Myoungsoo Jung
We investigate runtime environment characteristics and explore the challenges of conventional in-memory graph processing. This system-level analysis includes empirical results and observations, which are opposite to the existing expectations of graph application users. Specifically, since raw graph data are not the same as the in-memory graph data, processing a billion-scale graph exhausts all system resources and makes the target system unavailable due to out-of-memory at runtime.To address a lack of memory space problem for big-scale graph analysis, we configure real persistent memory devices (PMEMs) with different operation modes and system software frameworks. In this work, we introduce PMEM to a representative in-memory graph system, Ligra, and perform an in-depth analysis uncovering the performance behaviors of different PMEM-applied in-memory graph systems. Based on our observations, we modify Ligra to improve the graph processing performance with a solid level of data persistence. Our evaluation results reveal that Ligra, with our simple modification, exhibits 4.41× and 3.01× better performance than the original Ligra running on a virtual memory expansion and conventional persistent memory, respectively.
我们研究了运行时环境的特征,并探讨了传统内存中图形处理的挑战。这个系统级的分析包括实证结果和观察结果,这与图形应用程序用户现有的期望相反。具体来说,由于原始图形数据与内存中的图形数据不同,处理十亿规模的图形将耗尽所有系统资源,并在运行时由于内存不足而使目标系统不可用。为了解决大规模图形分析缺乏内存空间的问题,我们配置了具有不同操作模式和系统软件框架的真实持久存储设备(PMEMs)。在这项工作中,我们将PMEM引入具有代表性的内存图系统Ligra,并深入分析了不同PMEM应用的内存图系统的性能行为。根据我们的观察,我们修改了Ligra,以提高图形处理性能和稳固的数据持久性。我们的评估结果表明,经过简单修改的Ligra在虚拟内存扩展和传统持久内存上运行的性能分别比原始Ligra提高4.41倍和3.01倍。
{"title":"Empirical Guide to Use of Persistent Memory for Large-Scale In-Memory Graph Analysis","authors":"Hanyeoreum Bae, Miryeong Kwon, Donghyun Gouk, Sanghyun Han, Sungjoon Koh, Changrim Lee, Dongchul Park, Myoungsoo Jung","doi":"10.1109/ICCD53106.2021.00057","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00057","url":null,"abstract":"We investigate runtime environment characteristics and explore the challenges of conventional in-memory graph processing. This system-level analysis includes empirical results and observations, which are opposite to the existing expectations of graph application users. Specifically, since raw graph data are not the same as the in-memory graph data, processing a billion-scale graph exhausts all system resources and makes the target system unavailable due to out-of-memory at runtime.To address a lack of memory space problem for big-scale graph analysis, we configure real persistent memory devices (PMEMs) with different operation modes and system software frameworks. In this work, we introduce PMEM to a representative in-memory graph system, Ligra, and perform an in-depth analysis uncovering the performance behaviors of different PMEM-applied in-memory graph systems. Based on our observations, we modify Ligra to improve the graph processing performance with a solid level of data persistence. Our evaluation results reveal that Ligra, with our simple modification, exhibits 4.41× and 3.01× better performance than the original Ligra running on a virtual memory expansion and conventional persistent memory, respectively.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128873793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Efficient Hybrid Parallel Compression Approximate Multiplier 一种高效的混合并行压缩近似乘法器
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00028
Shangshang Yao, L. Zhang, Qiong Wang, Libin Shen
Approximate computing has been widely used in many fault-tolerant applications. Multiplication as a key kernel in such applications, it is significant to improve the efficiency of approximate multiplier to achieve high computational performance. This paper proposes a novel approximate multiplier design based on using different compressors for different regions of partial products. We designed two Preprocessing Units (PUs) to explore the best efficiency via increasing the number of sparse partial products. Multiple 8-bit multipliers are designed using Verilog and synthesized under the 45-nm CMOS technology. Compared with the conventional Wallace Tree multiplier, experimental results indicate that one of our proposed multipliers reduce Power-Delay Product (PDP) by 58.5% at most with 0.42% normalized mean error distance. Moreover, a case study of image processing applications is also investigated. Our proposed multipliers can achieve a high peak signal-to-noise ratio of 51.87dB. Compared to the state-of-the-art, the proposed multiplier has a better comprehensive performance in accuracy, area and power consumption.
近似计算在容错应用中得到了广泛的应用。乘法作为这类应用的关键核心,提高近似乘法器的效率对实现高计算性能具有重要意义。本文提出了一种新的近似乘法器设计方法,该方法基于对部分产品的不同区域使用不同的压缩器。我们设计了两个预处理单元,通过增加稀疏部分积的数量来探索最佳效率。多个8位乘法器使用Verilog设计,并在45纳米CMOS技术下合成。实验结果表明,与传统的Wallace树乘法器相比,本文提出的乘法器最大可降低58.5%的功率延迟积(PDP),归一化平均误差距离为0.42%。此外,还对图像处理应用进行了案例研究。我们提出的乘法器可以实现51.87dB的峰值信噪比。与现有的乘法器相比,所提出的乘法器在精度、面积和功耗方面具有更好的综合性能。
{"title":"An Efficient Hybrid Parallel Compression Approximate Multiplier","authors":"Shangshang Yao, L. Zhang, Qiong Wang, Libin Shen","doi":"10.1109/ICCD53106.2021.00028","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00028","url":null,"abstract":"Approximate computing has been widely used in many fault-tolerant applications. Multiplication as a key kernel in such applications, it is significant to improve the efficiency of approximate multiplier to achieve high computational performance. This paper proposes a novel approximate multiplier design based on using different compressors for different regions of partial products. We designed two Preprocessing Units (PUs) to explore the best efficiency via increasing the number of sparse partial products. Multiple 8-bit multipliers are designed using Verilog and synthesized under the 45-nm CMOS technology. Compared with the conventional Wallace Tree multiplier, experimental results indicate that one of our proposed multipliers reduce Power-Delay Product (PDP) by 58.5% at most with 0.42% normalized mean error distance. Moreover, a case study of image processing applications is also investigated. Our proposed multipliers can achieve a high peak signal-to-noise ratio of 51.87dB. Compared to the state-of-the-art, the proposed multiplier has a better comprehensive performance in accuracy, area and power consumption.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129030317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
MetaTableLite: An Efficient Metadata Management Scheme for Tagged-Pointer-Based Spatial Safety MetaTableLite:一种高效的基于标记指针的空间安全元数据管理方案
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00042
Dongwei Chen, Dong Tong, Chun Yang, Xu Cheng
A tagged-pointer-based memory spatial safety protection system utilizes the unused bits in a pointer to store the boundary information of an object. This paper proposed a hybrid metadata management scheme, MetaTableLite, for tagged-pointer-based protections. We observed that objects of a large size only take a minority part of all the objects in a program. However, recording their boundary metadata with traditional pointer tags will incur large memory overheads. Based on this observation, we introduce a small supplementary table to maintain metadata for these few large objects. For small ones, MetaTableLite represents their boundaries with a 14-bit pointer tag, well utilizing the unused 16 bits in a conventional 64-bit pointer. MetaTableLite can achieve a 6% average memory overhead without alternating the conventional pointer representation.
一种基于标记指针的存储空间安全保护系统,利用指针中未使用的位来存储对象的边界信息。本文提出了一种基于标记指针保护的混合元数据管理方案MetaTableLite。我们观察到,大尺寸的对象只占程序中所有对象的一小部分。但是,用传统的指针标记记录它们的边界元数据会导致大量的内存开销。基于这一观察,我们引入一个小的补充表来维护这几个大对象的元数据。对于较小的指针,MetaTableLite用14位指针标记表示它们的边界,很好地利用了传统64位指针中未使用的16位。metatableelite可以在不改变传统指针表示的情况下实现6%的平均内存开销。
{"title":"MetaTableLite: An Efficient Metadata Management Scheme for Tagged-Pointer-Based Spatial Safety","authors":"Dongwei Chen, Dong Tong, Chun Yang, Xu Cheng","doi":"10.1109/ICCD53106.2021.00042","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00042","url":null,"abstract":"A tagged-pointer-based memory spatial safety protection system utilizes the unused bits in a pointer to store the boundary information of an object. This paper proposed a hybrid metadata management scheme, MetaTableLite, for tagged-pointer-based protections. We observed that objects of a large size only take a minority part of all the objects in a program. However, recording their boundary metadata with traditional pointer tags will incur large memory overheads. Based on this observation, we introduce a small supplementary table to maintain metadata for these few large objects. For small ones, MetaTableLite represents their boundaries with a 14-bit pointer tag, well utilizing the unused 16 bits in a conventional 64-bit pointer. MetaTableLite can achieve a 6% average memory overhead without alternating the conventional pointer representation.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117315491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
UH-JLS: A Parallel Ultra-High Throughput JPEG-LS Encoding Architecture for Lossless Image Compression UH-JLS:用于无损图像压缩的并行超高吞吐量JPEG-LS编码体系结构
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00060
Xuan Wang, Lei Gong, Chao Wang, Xi Li, Xuehai Zhou
The lossless image compression technique has a great application value in distortion-sensitive applications. JPEG-LS, as a mature lossless compression standard, is widely adopted for its excellent compression ratio. Many hardware JPEG-LS compressors are proposed on FPGAs and ASICs to achieve high energy efficiency and low cost. However, JPEG-LS has a contextual Read-After-Write (RAW) issue, making previous hardware either insufficiently explore its parallelism potential or induce other defects while parallelizing, such as compression ratio dropping and compatibility problems. In this paper, we propose a hardware/software co-design method for high-performance JPEG-LS compressor design. At the software level, we propose a pixel grouping scheduling scheme and the Pseudo-LS method to tap the parallelism aiming at the RAW issue. At the hardware level, we discuss the high-performance design methods of these software-level schemes and propose a design space exploration method to constrain the resource usage introduced by parallelization. To our knowledge, our architecture, UH-JLS, is the first pixel-level parallelization streaming image compressor based on the standard JPEG-LS. The experiments show that in the lossless manner and the Pseudo-LS manner, UH-JLS respectively achieves 5.6x and 7.1x speedup than the previous state-of-the-art FPGA-based JPEG-LS compressor.
无损图像压缩技术在失真敏感的应用中具有很大的应用价值。JPEG-LS作为一种成熟的无损压缩标准,因其优异的压缩比而被广泛采用。为了达到高能效和低成本的目的,在fpga和asic上提出了许多硬件JPEG-LS压缩器。但是,JPEG-LS存在上下文相关的写后读(RAW)问题,这使得以前的硬件要么无法充分挖掘其并行性潜力,要么在并行化时引发其他缺陷,例如压缩比下降和兼容性问题。本文提出了一种高性能JPEG-LS压缩器的软硬件协同设计方法。在软件层面,针对RAW数据的并行性问题,提出了像素分组调度方案和Pseudo-LS方法。在硬件层面,我们讨论了这些软件级方案的高性能设计方法,并提出了一种设计空间探索方法来约束并行化带来的资源使用。据我们所知,我们的架构UH-JLS是第一个基于标准JPEG-LS的像素级并行流图像压缩器。实验表明,在无损方式和伪ls方式下,UH-JLS分别比目前最先进的基于fpga的JPEG-LS压缩器加速5.6倍和7.1倍。
{"title":"UH-JLS: A Parallel Ultra-High Throughput JPEG-LS Encoding Architecture for Lossless Image Compression","authors":"Xuan Wang, Lei Gong, Chao Wang, Xi Li, Xuehai Zhou","doi":"10.1109/ICCD53106.2021.00060","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00060","url":null,"abstract":"The lossless image compression technique has a great application value in distortion-sensitive applications. JPEG-LS, as a mature lossless compression standard, is widely adopted for its excellent compression ratio. Many hardware JPEG-LS compressors are proposed on FPGAs and ASICs to achieve high energy efficiency and low cost. However, JPEG-LS has a contextual Read-After-Write (RAW) issue, making previous hardware either insufficiently explore its parallelism potential or induce other defects while parallelizing, such as compression ratio dropping and compatibility problems. In this paper, we propose a hardware/software co-design method for high-performance JPEG-LS compressor design. At the software level, we propose a pixel grouping scheduling scheme and the Pseudo-LS method to tap the parallelism aiming at the RAW issue. At the hardware level, we discuss the high-performance design methods of these software-level schemes and propose a design space exploration method to constrain the resource usage introduced by parallelization. To our knowledge, our architecture, UH-JLS, is the first pixel-level parallelization streaming image compressor based on the standard JPEG-LS. The experiments show that in the lossless manner and the Pseudo-LS manner, UH-JLS respectively achieves 5.6x and 7.1x speedup than the previous state-of-the-art FPGA-based JPEG-LS compressor.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130674203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
WAS-Deletion: Workload-Aware Secure Deletion Scheme for Solid-State Drives WAS-Deletion:基于工作负载感知的固态硬盘安全删除方案
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00047
Bingzhe Li, D. Du
Due to the intrinsic properties of Solid-State Drives (SSDs), invalid data remain in SSDs before erased by a garbage collection process, which increases the risk of being attacked by adversaries. Previous studies use erase and cryptography based schemes to purposely delete target data but face extremely large overhead. In this paper, we propose a Workload-Aware Secure Deletion scheme, called WAS-Deletion, to reduce the overhead of secure deletion by three major components. First, the WAS-Deletion scheme efficiently splits invalid and valid data into different blocks based on workload characteristics. Second, the WAS-Deletion scheme uses a new encryption allocation scheme, making the encryption follow the same direction as the write on multiple blocks and vertically encrypts pages with the same key in one block. Finally, a new adaptive scheduling scheme can dynamically change the configurations of different regions to further reduce secure deletion overhead based on the current workload. The experimental results indicate that the newly proposed WAS-Deletion scheme can reduce the secure deletion cost by about 1.2x to 12.9x compared to previous studies.
由于固态硬盘(ssd)的固有属性,在被垃圾收集进程擦除之前,无效数据会保留在ssd中,这增加了被攻击者攻击的风险。以往的研究使用基于擦除和加密的方案来有目的地删除目标数据,但面临着极大的开销。在本文中,我们提出了一种工作负载感知的安全删除方案,称为was - delete,以减少安全删除的开销,主要包括三个部分。首先,基于工作负载特征,高效地将无效数据和有效数据分割成不同的数据块。其次,was - delete方案使用新的加密分配方案,使加密遵循与多个块上的写入相同的方向,并垂直加密一个块中具有相同密钥的页面。最后,提出了一种新的自适应调度方案,可以根据当前工作负载动态改变不同区域的配置,进一步降低安全删除开销。实验结果表明,新提出的WAS-Deletion方案比现有方案的安全删除成本降低约1.2 ~ 12.9倍。
{"title":"WAS-Deletion: Workload-Aware Secure Deletion Scheme for Solid-State Drives","authors":"Bingzhe Li, D. Du","doi":"10.1109/ICCD53106.2021.00047","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00047","url":null,"abstract":"Due to the intrinsic properties of Solid-State Drives (SSDs), invalid data remain in SSDs before erased by a garbage collection process, which increases the risk of being attacked by adversaries. Previous studies use erase and cryptography based schemes to purposely delete target data but face extremely large overhead. In this paper, we propose a Workload-Aware Secure Deletion scheme, called WAS-Deletion, to reduce the overhead of secure deletion by three major components. First, the WAS-Deletion scheme efficiently splits invalid and valid data into different blocks based on workload characteristics. Second, the WAS-Deletion scheme uses a new encryption allocation scheme, making the encryption follow the same direction as the write on multiple blocks and vertically encrypts pages with the same key in one block. Finally, a new adaptive scheduling scheme can dynamically change the configurations of different regions to further reduce secure deletion overhead based on the current workload. The experimental results indicate that the newly proposed WAS-Deletion scheme can reduce the secure deletion cost by about 1.2x to 12.9x compared to previous studies.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126816486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Stochastic-HD: Leveraging Stochastic Computing on Hyper-Dimensional Computing 随机- hd:利用超维计算的随机计算
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00058
Yilun Hao, Saransh Gupta, Justin Morris, Behnam Khaleghi, Baris Aksanli, T. Simunic
Brain-inspired Hyperdimensional (HD) computing is a novel and efficient computing paradigm which is more hardware-friendly than the traditional machine learning algorithms, however, the latest encoding and similarity checking schemes still require thousands of operations. To further reduce the hardware cost of HD computing, we present Stochastic-HD that combines the simplicity of operations in Stochastic Computing (SC) with the complex task solving capabilities of the latest HD computing algorithms. Stochastic-HD leverages deterministic SC, which uses structured input binary bitstreams instead of the traditional randomly generated bitstreams thus avoids expensive SC components like stochastic number generators. We also propose an in-memory hardware design for Stochastic-HD that exploits its high level of parallelism and robustness to approximation. Our hardware uses in-memory bitwise operations along with associative memory-like operations to enable a fast and energy-efficient implementation. With Stochastic-HD, we were able to reach a comparable accuracy with the Baseline-HD. As compared to the best PIM design for HD [1], Stochastic-HD is also 4.4% more accurate and 43.1× more energy-efficient.
与传统的机器学习算法相比,脑启发的超维计算是一种新颖高效的计算范式,它对硬件更友好,然而,最新的编码和相似度检查方案仍然需要数千次操作。为了进一步降低高清计算的硬件成本,我们提出了将随机计算(SC)操作的简单性与最新高清计算算法的复杂任务求解能力相结合的随机高清算法。random - hd利用确定性SC,它使用结构化输入二进制比特流而不是传统的随机生成的比特流,从而避免了昂贵的SC组件,如随机数字生成器。我们还提出了一种内存硬件设计的随机高清,利用其高水平的并行性和鲁棒性逼近。我们的硬件使用内存中的位操作以及类似于关联内存的操作来实现快速和节能的实现。使用random - hd,我们能够达到与Baseline-HD相当的精度。与HD的最佳PIM设计[1]相比,random -HD的精度也提高了4.4%,能效提高了43.1倍。
{"title":"Stochastic-HD: Leveraging Stochastic Computing on Hyper-Dimensional Computing","authors":"Yilun Hao, Saransh Gupta, Justin Morris, Behnam Khaleghi, Baris Aksanli, T. Simunic","doi":"10.1109/ICCD53106.2021.00058","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00058","url":null,"abstract":"Brain-inspired Hyperdimensional (HD) computing is a novel and efficient computing paradigm which is more hardware-friendly than the traditional machine learning algorithms, however, the latest encoding and similarity checking schemes still require thousands of operations. To further reduce the hardware cost of HD computing, we present Stochastic-HD that combines the simplicity of operations in Stochastic Computing (SC) with the complex task solving capabilities of the latest HD computing algorithms. Stochastic-HD leverages deterministic SC, which uses structured input binary bitstreams instead of the traditional randomly generated bitstreams thus avoids expensive SC components like stochastic number generators. We also propose an in-memory hardware design for Stochastic-HD that exploits its high level of parallelism and robustness to approximation. Our hardware uses in-memory bitwise operations along with associative memory-like operations to enable a fast and energy-efficient implementation. With Stochastic-HD, we were able to reach a comparable accuracy with the Baseline-HD. As compared to the best PIM design for HD [1], Stochastic-HD is also 4.4% more accurate and 43.1× more energy-efficient.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121510384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2021 IEEE 39th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1