Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00053
Jiaqi Zhang, Xiangru Chen, S. Ray
There is a recent trend that the DNN workloads and accelerators are increasingly heterogeneous and dynamic. Existing DNN acceleration solutions fail to address these challenges because they either rely on complicated ad hoc mapping or clumpy exhaustive search. To this end, this paper first proposes a formalization model that can comprehensively describe the accelerator design space. Instead of enforcing certain customized dataflows, the proposed model explicitly captures the intrinsic hardware functions of a given accelerator. We connect these functions with the data reuse opportunities of the DNN computation and build a correspondence between DNN loop blocking and accelerator constraints. Based on this, we implement an algorithm that efficiently and effectively performs universal loop blocking for various DNNs and accelerators without manual specifications. The evaluation shows that our results manifest 2.1x and 1.5x speedup and energy efficiency over dataflow-defined algorithm as well as significant improvement in blocking latency compared with search-based methods.
{"title":"Universal Neural Network Acceleration via Real-Time Loop Blocking","authors":"Jiaqi Zhang, Xiangru Chen, S. Ray","doi":"10.1109/ICCD53106.2021.00053","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00053","url":null,"abstract":"There is a recent trend that the DNN workloads and accelerators are increasingly heterogeneous and dynamic. Existing DNN acceleration solutions fail to address these challenges because they either rely on complicated ad hoc mapping or clumpy exhaustive search. To this end, this paper first proposes a formalization model that can comprehensively describe the accelerator design space. Instead of enforcing certain customized dataflows, the proposed model explicitly captures the intrinsic hardware functions of a given accelerator. We connect these functions with the data reuse opportunities of the DNN computation and build a correspondence between DNN loop blocking and accelerator constraints. Based on this, we implement an algorithm that efficiently and effectively performs universal loop blocking for various DNNs and accelerators without manual specifications. The evaluation shows that our results manifest 2.1x and 1.5x speedup and energy efficiency over dataflow-defined algorithm as well as significant improvement in blocking latency compared with search-based methods.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"54 82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123503790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00054
Han Zhao, Weihao Cui, Quan Chen, Jieru Zhao, Jingwen Leng, M. Guo
Emerging GPUs have multiple Streaming Multiprocessors (SM), while each SM is comprised of CUDA Cores and Tensor Cores. While CUDA Cores do the general computation, Tensor Cores are designed to speed up matrix multiplication for deep learning applications. However, a GPU kernel often either uses CUDA Cores or Tensor Cores, leaving the other processing units idle. Although many prior research works have been proposed to co-locate kernels to improve GPU utilization, they cannot leverage the Intra-SM CUDA Core-Tensor Core Parallelism. We therefore propose Plasticine to exploit the intra-SM parallelism for maximizing the GPU throughput. Plasticine involves compilation and runtime schedule to achieve the above purpose. Experimental results on an Nvidia 2080Ti GPU show that Plasticine improves the system-wide throughput by 15.3% compared with prior co-location work.
{"title":"Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks","authors":"Han Zhao, Weihao Cui, Quan Chen, Jieru Zhao, Jingwen Leng, M. Guo","doi":"10.1109/ICCD53106.2021.00054","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00054","url":null,"abstract":"Emerging GPUs have multiple Streaming Multiprocessors (SM), while each SM is comprised of CUDA Cores and Tensor Cores. While CUDA Cores do the general computation, Tensor Cores are designed to speed up matrix multiplication for deep learning applications. However, a GPU kernel often either uses CUDA Cores or Tensor Cores, leaving the other processing units idle. Although many prior research works have been proposed to co-locate kernels to improve GPU utilization, they cannot leverage the Intra-SM CUDA Core-Tensor Core Parallelism. We therefore propose Plasticine to exploit the intra-SM parallelism for maximizing the GPU throughput. Plasticine involves compilation and runtime schedule to achieve the above purpose. Experimental results on an Nvidia 2080Ti GPU show that Plasticine improves the system-wide throughput by 15.3% compared with prior co-location work.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124687084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00050
Yi-Jou Hsiao, Chin-Fu Nien, Hsiang-Yun Cheng
Sparse matrix-vector multiplication (SpMV) serves as a crucial operation for several key application domains, such as graph analytics and scientific computing, in the era of big data. The performance of SpMV is bounded by the data transmissions across memory channels in conventional von Neumann systems. Emerging metal-oxide resistive random access memory (ReRAM) has shown its potential to address this memory wall challenge through performing SpMV directly within its crossbar arrays. However, due to the tightly coupled crossbar structure, it is unlikely to skip all redundant data loading and computations with zero-valued entries of the sparse matrix in such ReRAM-based processing-in-memory architecture. These unnecessary ReRAM writes and computations hurt the energy efficiency. As only the crossbar-sized sub-matrices with full-zero entries can be skipped, prior studies have proposed some matrix reordering methods to aggregate non-zero entries to few crossbar arrays, such that more full-zero crossbar arrays can be skipped. Nevertheless, the effectiveness of prior reordering methods is constrained by the original ordering of matrix rows. In this paper, we show that the amount of full-zero sub-matrices derived by these prior studies are less than a theoretical lower bound in some cases, indicating that there are still rooms for improvement. Hence, we propose a novel reordering algorithm, ReSpar, that aims to aggregate matrix rows with similar non-zero column entries together and concentrates the non-zeros columns to increase the zero-skipping opportunities. Results show that ReSpar achieves 1.68× and 1.37× more energy savings, while reducing the required number of crossbar loads by 40.4% and 27.2% on average.
{"title":"ReSpar: Reordering Algorithm for ReRAM-based Sparse Matrix-Vector Multiplication Accelerator","authors":"Yi-Jou Hsiao, Chin-Fu Nien, Hsiang-Yun Cheng","doi":"10.1109/ICCD53106.2021.00050","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00050","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) serves as a crucial operation for several key application domains, such as graph analytics and scientific computing, in the era of big data. The performance of SpMV is bounded by the data transmissions across memory channels in conventional von Neumann systems. Emerging metal-oxide resistive random access memory (ReRAM) has shown its potential to address this memory wall challenge through performing SpMV directly within its crossbar arrays. However, due to the tightly coupled crossbar structure, it is unlikely to skip all redundant data loading and computations with zero-valued entries of the sparse matrix in such ReRAM-based processing-in-memory architecture. These unnecessary ReRAM writes and computations hurt the energy efficiency. As only the crossbar-sized sub-matrices with full-zero entries can be skipped, prior studies have proposed some matrix reordering methods to aggregate non-zero entries to few crossbar arrays, such that more full-zero crossbar arrays can be skipped. Nevertheless, the effectiveness of prior reordering methods is constrained by the original ordering of matrix rows. In this paper, we show that the amount of full-zero sub-matrices derived by these prior studies are less than a theoretical lower bound in some cases, indicating that there are still rooms for improvement. Hence, we propose a novel reordering algorithm, ReSpar, that aims to aggregate matrix rows with similar non-zero column entries together and concentrates the non-zeros columns to increase the zero-skipping opportunities. Results show that ReSpar achieves 1.68× and 1.37× more energy savings, while reducing the required number of crossbar loads by 40.4% and 27.2% on average.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"104 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127438548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00041
Yohan Ko, Hwisoo So, Jinhyo Jung, Kyoungwoo Lee, Aviral Shrivastava
With technology scaling, reliability against soft errors is becoming an important design concern for modern embedded systems. To avoid the high cost and performance overheads of full protection techniques, several researches have therefore turned their focus to selective protection techniques. This increases the need to accurately identify the most vulnerable components or instructions in a system. In this paper, we analyze the vulnerability of a system from both the hardware and software perspectives through intensive fault injection trials. From the hardware perspective, we find the most vulnerable hardware components by calculating component-wise failure rates. From the software perspective, we identify the most vulnerable instructions by using the novel root cause instruction analysis. With our results, we show that it is possible to reduce the failure rate of a system to only 12.40% with minimal protection.
{"title":"Comprehensive Failure Analysis against Soft Errors from Hardware and Software Perspectives","authors":"Yohan Ko, Hwisoo So, Jinhyo Jung, Kyoungwoo Lee, Aviral Shrivastava","doi":"10.1109/ICCD53106.2021.00041","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00041","url":null,"abstract":"With technology scaling, reliability against soft errors is becoming an important design concern for modern embedded systems. To avoid the high cost and performance overheads of full protection techniques, several researches have therefore turned their focus to selective protection techniques. This increases the need to accurately identify the most vulnerable components or instructions in a system. In this paper, we analyze the vulnerability of a system from both the hardware and software perspectives through intensive fault injection trials. From the hardware perspective, we find the most vulnerable hardware components by calculating component-wise failure rates. From the software perspective, we identify the most vulnerable instructions by using the novel root cause instruction analysis. With our results, we show that it is possible to reduce the failure rate of a system to only 12.40% with minimal protection.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127455223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate runtime environment characteristics and explore the challenges of conventional in-memory graph processing. This system-level analysis includes empirical results and observations, which are opposite to the existing expectations of graph application users. Specifically, since raw graph data are not the same as the in-memory graph data, processing a billion-scale graph exhausts all system resources and makes the target system unavailable due to out-of-memory at runtime.To address a lack of memory space problem for big-scale graph analysis, we configure real persistent memory devices (PMEMs) with different operation modes and system software frameworks. In this work, we introduce PMEM to a representative in-memory graph system, Ligra, and perform an in-depth analysis uncovering the performance behaviors of different PMEM-applied in-memory graph systems. Based on our observations, we modify Ligra to improve the graph processing performance with a solid level of data persistence. Our evaluation results reveal that Ligra, with our simple modification, exhibits 4.41× and 3.01× better performance than the original Ligra running on a virtual memory expansion and conventional persistent memory, respectively.
{"title":"Empirical Guide to Use of Persistent Memory for Large-Scale In-Memory Graph Analysis","authors":"Hanyeoreum Bae, Miryeong Kwon, Donghyun Gouk, Sanghyun Han, Sungjoon Koh, Changrim Lee, Dongchul Park, Myoungsoo Jung","doi":"10.1109/ICCD53106.2021.00057","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00057","url":null,"abstract":"We investigate runtime environment characteristics and explore the challenges of conventional in-memory graph processing. This system-level analysis includes empirical results and observations, which are opposite to the existing expectations of graph application users. Specifically, since raw graph data are not the same as the in-memory graph data, processing a billion-scale graph exhausts all system resources and makes the target system unavailable due to out-of-memory at runtime.To address a lack of memory space problem for big-scale graph analysis, we configure real persistent memory devices (PMEMs) with different operation modes and system software frameworks. In this work, we introduce PMEM to a representative in-memory graph system, Ligra, and perform an in-depth analysis uncovering the performance behaviors of different PMEM-applied in-memory graph systems. Based on our observations, we modify Ligra to improve the graph processing performance with a solid level of data persistence. Our evaluation results reveal that Ligra, with our simple modification, exhibits 4.41× and 3.01× better performance than the original Ligra running on a virtual memory expansion and conventional persistent memory, respectively.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128873793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00028
Shangshang Yao, L. Zhang, Qiong Wang, Libin Shen
Approximate computing has been widely used in many fault-tolerant applications. Multiplication as a key kernel in such applications, it is significant to improve the efficiency of approximate multiplier to achieve high computational performance. This paper proposes a novel approximate multiplier design based on using different compressors for different regions of partial products. We designed two Preprocessing Units (PUs) to explore the best efficiency via increasing the number of sparse partial products. Multiple 8-bit multipliers are designed using Verilog and synthesized under the 45-nm CMOS technology. Compared with the conventional Wallace Tree multiplier, experimental results indicate that one of our proposed multipliers reduce Power-Delay Product (PDP) by 58.5% at most with 0.42% normalized mean error distance. Moreover, a case study of image processing applications is also investigated. Our proposed multipliers can achieve a high peak signal-to-noise ratio of 51.87dB. Compared to the state-of-the-art, the proposed multiplier has a better comprehensive performance in accuracy, area and power consumption.
{"title":"An Efficient Hybrid Parallel Compression Approximate Multiplier","authors":"Shangshang Yao, L. Zhang, Qiong Wang, Libin Shen","doi":"10.1109/ICCD53106.2021.00028","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00028","url":null,"abstract":"Approximate computing has been widely used in many fault-tolerant applications. Multiplication as a key kernel in such applications, it is significant to improve the efficiency of approximate multiplier to achieve high computational performance. This paper proposes a novel approximate multiplier design based on using different compressors for different regions of partial products. We designed two Preprocessing Units (PUs) to explore the best efficiency via increasing the number of sparse partial products. Multiple 8-bit multipliers are designed using Verilog and synthesized under the 45-nm CMOS technology. Compared with the conventional Wallace Tree multiplier, experimental results indicate that one of our proposed multipliers reduce Power-Delay Product (PDP) by 58.5% at most with 0.42% normalized mean error distance. Moreover, a case study of image processing applications is also investigated. Our proposed multipliers can achieve a high peak signal-to-noise ratio of 51.87dB. Compared to the state-of-the-art, the proposed multiplier has a better comprehensive performance in accuracy, area and power consumption.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129030317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00042
Dongwei Chen, Dong Tong, Chun Yang, Xu Cheng
A tagged-pointer-based memory spatial safety protection system utilizes the unused bits in a pointer to store the boundary information of an object. This paper proposed a hybrid metadata management scheme, MetaTableLite, for tagged-pointer-based protections. We observed that objects of a large size only take a minority part of all the objects in a program. However, recording their boundary metadata with traditional pointer tags will incur large memory overheads. Based on this observation, we introduce a small supplementary table to maintain metadata for these few large objects. For small ones, MetaTableLite represents their boundaries with a 14-bit pointer tag, well utilizing the unused 16 bits in a conventional 64-bit pointer. MetaTableLite can achieve a 6% average memory overhead without alternating the conventional pointer representation.
{"title":"MetaTableLite: An Efficient Metadata Management Scheme for Tagged-Pointer-Based Spatial Safety","authors":"Dongwei Chen, Dong Tong, Chun Yang, Xu Cheng","doi":"10.1109/ICCD53106.2021.00042","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00042","url":null,"abstract":"A tagged-pointer-based memory spatial safety protection system utilizes the unused bits in a pointer to store the boundary information of an object. This paper proposed a hybrid metadata management scheme, MetaTableLite, for tagged-pointer-based protections. We observed that objects of a large size only take a minority part of all the objects in a program. However, recording their boundary metadata with traditional pointer tags will incur large memory overheads. Based on this observation, we introduce a small supplementary table to maintain metadata for these few large objects. For small ones, MetaTableLite represents their boundaries with a 14-bit pointer tag, well utilizing the unused 16 bits in a conventional 64-bit pointer. MetaTableLite can achieve a 6% average memory overhead without alternating the conventional pointer representation.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117315491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00060
Xuan Wang, Lei Gong, Chao Wang, Xi Li, Xuehai Zhou
The lossless image compression technique has a great application value in distortion-sensitive applications. JPEG-LS, as a mature lossless compression standard, is widely adopted for its excellent compression ratio. Many hardware JPEG-LS compressors are proposed on FPGAs and ASICs to achieve high energy efficiency and low cost. However, JPEG-LS has a contextual Read-After-Write (RAW) issue, making previous hardware either insufficiently explore its parallelism potential or induce other defects while parallelizing, such as compression ratio dropping and compatibility problems. In this paper, we propose a hardware/software co-design method for high-performance JPEG-LS compressor design. At the software level, we propose a pixel grouping scheduling scheme and the Pseudo-LS method to tap the parallelism aiming at the RAW issue. At the hardware level, we discuss the high-performance design methods of these software-level schemes and propose a design space exploration method to constrain the resource usage introduced by parallelization. To our knowledge, our architecture, UH-JLS, is the first pixel-level parallelization streaming image compressor based on the standard JPEG-LS. The experiments show that in the lossless manner and the Pseudo-LS manner, UH-JLS respectively achieves 5.6x and 7.1x speedup than the previous state-of-the-art FPGA-based JPEG-LS compressor.
{"title":"UH-JLS: A Parallel Ultra-High Throughput JPEG-LS Encoding Architecture for Lossless Image Compression","authors":"Xuan Wang, Lei Gong, Chao Wang, Xi Li, Xuehai Zhou","doi":"10.1109/ICCD53106.2021.00060","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00060","url":null,"abstract":"The lossless image compression technique has a great application value in distortion-sensitive applications. JPEG-LS, as a mature lossless compression standard, is widely adopted for its excellent compression ratio. Many hardware JPEG-LS compressors are proposed on FPGAs and ASICs to achieve high energy efficiency and low cost. However, JPEG-LS has a contextual Read-After-Write (RAW) issue, making previous hardware either insufficiently explore its parallelism potential or induce other defects while parallelizing, such as compression ratio dropping and compatibility problems. In this paper, we propose a hardware/software co-design method for high-performance JPEG-LS compressor design. At the software level, we propose a pixel grouping scheduling scheme and the Pseudo-LS method to tap the parallelism aiming at the RAW issue. At the hardware level, we discuss the high-performance design methods of these software-level schemes and propose a design space exploration method to constrain the resource usage introduced by parallelization. To our knowledge, our architecture, UH-JLS, is the first pixel-level parallelization streaming image compressor based on the standard JPEG-LS. The experiments show that in the lossless manner and the Pseudo-LS manner, UH-JLS respectively achieves 5.6x and 7.1x speedup than the previous state-of-the-art FPGA-based JPEG-LS compressor.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130674203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00047
Bingzhe Li, D. Du
Due to the intrinsic properties of Solid-State Drives (SSDs), invalid data remain in SSDs before erased by a garbage collection process, which increases the risk of being attacked by adversaries. Previous studies use erase and cryptography based schemes to purposely delete target data but face extremely large overhead. In this paper, we propose a Workload-Aware Secure Deletion scheme, called WAS-Deletion, to reduce the overhead of secure deletion by three major components. First, the WAS-Deletion scheme efficiently splits invalid and valid data into different blocks based on workload characteristics. Second, the WAS-Deletion scheme uses a new encryption allocation scheme, making the encryption follow the same direction as the write on multiple blocks and vertically encrypts pages with the same key in one block. Finally, a new adaptive scheduling scheme can dynamically change the configurations of different regions to further reduce secure deletion overhead based on the current workload. The experimental results indicate that the newly proposed WAS-Deletion scheme can reduce the secure deletion cost by about 1.2x to 12.9x compared to previous studies.
{"title":"WAS-Deletion: Workload-Aware Secure Deletion Scheme for Solid-State Drives","authors":"Bingzhe Li, D. Du","doi":"10.1109/ICCD53106.2021.00047","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00047","url":null,"abstract":"Due to the intrinsic properties of Solid-State Drives (SSDs), invalid data remain in SSDs before erased by a garbage collection process, which increases the risk of being attacked by adversaries. Previous studies use erase and cryptography based schemes to purposely delete target data but face extremely large overhead. In this paper, we propose a Workload-Aware Secure Deletion scheme, called WAS-Deletion, to reduce the overhead of secure deletion by three major components. First, the WAS-Deletion scheme efficiently splits invalid and valid data into different blocks based on workload characteristics. Second, the WAS-Deletion scheme uses a new encryption allocation scheme, making the encryption follow the same direction as the write on multiple blocks and vertically encrypts pages with the same key in one block. Finally, a new adaptive scheduling scheme can dynamically change the configurations of different regions to further reduce secure deletion overhead based on the current workload. The experimental results indicate that the newly proposed WAS-Deletion scheme can reduce the secure deletion cost by about 1.2x to 12.9x compared to previous studies.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126816486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00058
Yilun Hao, Saransh Gupta, Justin Morris, Behnam Khaleghi, Baris Aksanli, T. Simunic
Brain-inspired Hyperdimensional (HD) computing is a novel and efficient computing paradigm which is more hardware-friendly than the traditional machine learning algorithms, however, the latest encoding and similarity checking schemes still require thousands of operations. To further reduce the hardware cost of HD computing, we present Stochastic-HD that combines the simplicity of operations in Stochastic Computing (SC) with the complex task solving capabilities of the latest HD computing algorithms. Stochastic-HD leverages deterministic SC, which uses structured input binary bitstreams instead of the traditional randomly generated bitstreams thus avoids expensive SC components like stochastic number generators. We also propose an in-memory hardware design for Stochastic-HD that exploits its high level of parallelism and robustness to approximation. Our hardware uses in-memory bitwise operations along with associative memory-like operations to enable a fast and energy-efficient implementation. With Stochastic-HD, we were able to reach a comparable accuracy with the Baseline-HD. As compared to the best PIM design for HD [1], Stochastic-HD is also 4.4% more accurate and 43.1× more energy-efficient.
{"title":"Stochastic-HD: Leveraging Stochastic Computing on Hyper-Dimensional Computing","authors":"Yilun Hao, Saransh Gupta, Justin Morris, Behnam Khaleghi, Baris Aksanli, T. Simunic","doi":"10.1109/ICCD53106.2021.00058","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00058","url":null,"abstract":"Brain-inspired Hyperdimensional (HD) computing is a novel and efficient computing paradigm which is more hardware-friendly than the traditional machine learning algorithms, however, the latest encoding and similarity checking schemes still require thousands of operations. To further reduce the hardware cost of HD computing, we present Stochastic-HD that combines the simplicity of operations in Stochastic Computing (SC) with the complex task solving capabilities of the latest HD computing algorithms. Stochastic-HD leverages deterministic SC, which uses structured input binary bitstreams instead of the traditional randomly generated bitstreams thus avoids expensive SC components like stochastic number generators. We also propose an in-memory hardware design for Stochastic-HD that exploits its high level of parallelism and robustness to approximation. Our hardware uses in-memory bitwise operations along with associative memory-like operations to enable a fast and energy-efficient implementation. With Stochastic-HD, we were able to reach a comparable accuracy with the Baseline-HD. As compared to the best PIM design for HD [1], Stochastic-HD is also 4.4% more accurate and 43.1× more energy-efficient.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121510384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}