Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00017
Zehao Chen, Bingzhe Li, Xiaojun Cai, Zhiping Jia, Zhaoyan Shen, Yi Wang, Z. Shao
Ethereum as one of the largest blockchain systems plays an important role in the distributed ledger, database systems, etc. As more and more blocks are mined, the storage burden of Ethereum is significantly increased. The current Ethereum system uniformly transforms all its data into key-value (KV) items and stores them to the underlying Log-Structure Merged tree (LSM-tree) storage engine ignoring the software semantics. Consequently, it not only exacerbates the write amplification effect of the storage engine but also hurts the performance of Ethereum. In this paper, we proposed a new Ethereum-aware storage model called Block-LSM, which significantly improves the data synchronization of the Ethereum system. Specifically, we first design a shared prefix scheme to transform Ethereum data into ordered KV pairs to alleviate the key range overlaps of different levels in the underlying LSM-tree based storage engine. Moreover, we propose to maintain several semantic-orientated memory buffers to isolate different kinds of Ethereum data. To save space overhead, Block-LSM further aggregates multiple blocks into a group and assigns the same prefix to all KV items from the same block group. Finally, we implement Block-LSM in the real Ethereum environment and conduct a series of experiments. The evaluation results show that Block-LSM significantly reduces up to 3.7× storage write amplification and increases throughput by 3× compared with the original Ethereum design.
{"title":"Block-LSM: An Ether-aware Block-ordered LSM-tree based Key-Value Storage Engine","authors":"Zehao Chen, Bingzhe Li, Xiaojun Cai, Zhiping Jia, Zhaoyan Shen, Yi Wang, Z. Shao","doi":"10.1109/ICCD53106.2021.00017","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00017","url":null,"abstract":"Ethereum as one of the largest blockchain systems plays an important role in the distributed ledger, database systems, etc. As more and more blocks are mined, the storage burden of Ethereum is significantly increased. The current Ethereum system uniformly transforms all its data into key-value (KV) items and stores them to the underlying Log-Structure Merged tree (LSM-tree) storage engine ignoring the software semantics. Consequently, it not only exacerbates the write amplification effect of the storage engine but also hurts the performance of Ethereum. In this paper, we proposed a new Ethereum-aware storage model called Block-LSM, which significantly improves the data synchronization of the Ethereum system. Specifically, we first design a shared prefix scheme to transform Ethereum data into ordered KV pairs to alleviate the key range overlaps of different levels in the underlying LSM-tree based storage engine. Moreover, we propose to maintain several semantic-orientated memory buffers to isolate different kinds of Ethereum data. To save space overhead, Block-LSM further aggregates multiple blocks into a group and assigns the same prefix to all KV items from the same block group. Finally, we implement Block-LSM in the real Ethereum environment and conduct a series of experiments. The evaluation results show that Block-LSM significantly reduces up to 3.7× storage write amplification and increases throughput by 3× compared with the original Ethereum design.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114309622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00082
Hao Zheng, Md Rubel Ahmed, P. Mukherjee, M. Ketkar, Jin Yang
Concise and abstract models of system-level behaviors are invaluable in design analysis, testing, and validation. In this paper, we consider the problem of inferring models from communication traces of system-on-chip (SoC) designs. The traces capture communications among different blocks of a system design in terms of messages exchanged. The extracted models characterize the system-level communication protocols governing how blocks exchange messages, and coordinate with each other to realize various system functions. In this paper, the above problem is formulated as a constraint satisfaction problem, which is then fed to a satisfiability modulo theories (SMT) solver. The solutions returned by the SMT solver are used to extract the models that accept the input traces. In the experiments, we demonstrate the proposed approach with traces collected from a transaction-level simulation model of a multicore SoC design and a trace of a more detailed multicore SoC modeled in GEM5.
{"title":"Model Synthesis for Communication Traces of System Designs","authors":"Hao Zheng, Md Rubel Ahmed, P. Mukherjee, M. Ketkar, Jin Yang","doi":"10.1109/ICCD53106.2021.00082","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00082","url":null,"abstract":"Concise and abstract models of system-level behaviors are invaluable in design analysis, testing, and validation. In this paper, we consider the problem of inferring models from communication traces of system-on-chip (SoC) designs. The traces capture communications among different blocks of a system design in terms of messages exchanged. The extracted models characterize the system-level communication protocols governing how blocks exchange messages, and coordinate with each other to realize various system functions. In this paper, the above problem is formulated as a constraint satisfaction problem, which is then fed to a satisfiability modulo theories (SMT) solver. The solutions returned by the SMT solver are used to extract the models that accept the input traces. In the experiments, we demonstrate the proposed approach with traces collected from a transaction-level simulation model of a multicore SoC design and a trace of a more detailed multicore SoC modeled in GEM5.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130297321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00043
Kwangrae Kim, Jeonghyun Woo, Junsu Kim, Ki-Seok Chung
The continuous scaling-down of the dynamic random access memory (DRAM) manufacturing process has made it possible to improve DRAM density. However, it makes small DRAM cells susceptible to electromagnetic interference between nearby cells. Unless DRAM cells are adequately isolated from each other, the frequent switching access of some cells may lead to unintended bit flips in adjacent cells. This phenomenon is commonly referred to as RowHammer. It is often considered a security issue because unusually frequent accesses to a small set of rows generated by malicious attacks can cause bit flips. Such bit flips may also be caused by general applications. Although several solutions have been proposed, most approaches either incur excessive area overhead or exhibit limited prevention capabilities against maliciously crafted attack patterns. Therefore, the goals of this study are (1) to mitigate RowHammer, even when the number of aggressor rows increases and attack patterns become complicated, and (2) to implement the method with a low area overhead.We propose a robust hardware-based protection method for RowHammer attacks with a low hardware cost called HammerFilter, which employs a modified version of the counting bloom filter. It tracks all attacking rows efficiently by leveraging the fact that the counting bloom filter is a space-efficient data structure, and we add an operation, HALF-DELETE, to mitigate the energy overhead. According to our experimental results, the proposed method can completely prevent bit flips when facing artificially crafted attack patterns (five patterns in our experiments), whereas state-of-the-art probabilistic solutions can only mitigate less than 56% of bit flips on average. Furthermore, the proposed method has a much lower area cost compared to existing counter-based solutions (40.6× better than TWiCe and 2.3× better than Graphene).
{"title":"HammerFilter: Robust Protection and Low Hardware Overhead Method for RowHammer","authors":"Kwangrae Kim, Jeonghyun Woo, Junsu Kim, Ki-Seok Chung","doi":"10.1109/ICCD53106.2021.00043","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00043","url":null,"abstract":"The continuous scaling-down of the dynamic random access memory (DRAM) manufacturing process has made it possible to improve DRAM density. However, it makes small DRAM cells susceptible to electromagnetic interference between nearby cells. Unless DRAM cells are adequately isolated from each other, the frequent switching access of some cells may lead to unintended bit flips in adjacent cells. This phenomenon is commonly referred to as RowHammer. It is often considered a security issue because unusually frequent accesses to a small set of rows generated by malicious attacks can cause bit flips. Such bit flips may also be caused by general applications. Although several solutions have been proposed, most approaches either incur excessive area overhead or exhibit limited prevention capabilities against maliciously crafted attack patterns. Therefore, the goals of this study are (1) to mitigate RowHammer, even when the number of aggressor rows increases and attack patterns become complicated, and (2) to implement the method with a low area overhead.We propose a robust hardware-based protection method for RowHammer attacks with a low hardware cost called HammerFilter, which employs a modified version of the counting bloom filter. It tracks all attacking rows efficiently by leveraging the fact that the counting bloom filter is a space-efficient data structure, and we add an operation, HALF-DELETE, to mitigate the energy overhead. According to our experimental results, the proposed method can completely prevent bit flips when facing artificially crafted attack patterns (five patterns in our experiments), whereas state-of-the-art probabilistic solutions can only mitigate less than 56% of bit flips on average. Furthermore, the proposed method has a much lower area cost compared to existing counter-based solutions (40.6× better than TWiCe and 2.3× better than Graphene).","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133106368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks (DNNs) have gained considerable attention in various real-world applications due to their strong performance in representation learning. However, running a DNN needs tremendous memory resources, which significantly restricts DNN from being applicable on resource-constrained platforms (e.g., IoT, mobile devices, etc.). Lightweight DNNs can accommodate the characteristics of mobile devices, but the hardware resources of mobile or IoT devices are extremely limited, and the resource consumption of lightweight models needs to be further reduced. However, the current neural network compression approaches (i.e., pruning, quantization, knowledge distillation, etc.) works poorly on the lightweight DNNs, which are already simplified. In this paper, we present a novel framework called Smart-DNN, which can efficiently reduce the memory requirements of running DNNs on resource-constrained platforms. Specifically, we slice a neural network into several segments and use SZ error-bounded lossy compression to compress each segment separately while keeping the network structure unchanged. When running a network, we first store the compressed network into memory and then partially decompress the corresponding part layer by layer. According to experimental results on four popular lightweight DNNs (usually used in resource-constrained platforms), Smart-DNN achieves memory saving of 1/10∼1/5, while slightly sacrificing inference accuracy and unchanging the neural network structure with accepted extra runtime overhead.
{"title":"Smart-DNN: Efficiently Reducing the Memory Requirements of Running Deep Neural Networks on Resource-constrained Platforms","authors":"Zhenbo Hu, Xiangyu Zou, Wen Xia, Yuhong Zhao, Weizhe Zhang, Donglei Wu","doi":"10.1109/ICCD53106.2021.00087","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00087","url":null,"abstract":"Deep neural networks (DNNs) have gained considerable attention in various real-world applications due to their strong performance in representation learning. However, running a DNN needs tremendous memory resources, which significantly restricts DNN from being applicable on resource-constrained platforms (e.g., IoT, mobile devices, etc.). Lightweight DNNs can accommodate the characteristics of mobile devices, but the hardware resources of mobile or IoT devices are extremely limited, and the resource consumption of lightweight models needs to be further reduced. However, the current neural network compression approaches (i.e., pruning, quantization, knowledge distillation, etc.) works poorly on the lightweight DNNs, which are already simplified. In this paper, we present a novel framework called Smart-DNN, which can efficiently reduce the memory requirements of running DNNs on resource-constrained platforms. Specifically, we slice a neural network into several segments and use SZ error-bounded lossy compression to compress each segment separately while keeping the network structure unchanged. When running a network, we first store the compressed network into memory and then partially decompress the corresponding part layer by layer. According to experimental results on four popular lightweight DNNs (usually used in resource-constrained platforms), Smart-DNN achieves memory saving of 1/10∼1/5, while slightly sacrificing inference accuracy and unchanging the neural network structure with accepted extra runtime overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"22 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113976420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00022
Mark Connolly, Purab Ranjan Sutradhar, Mark A. Indovina, A. Ganguly
Processing in Memory (PIM) is a recent novel computing paradigm that is still in its nascent stage of development. Therefore, there has been an observable lack of standardized and modular Instruction Set Architectures (ISA) for the PIM devices. In this work, we present the design of an ISA which is primarily aimed at a recent programmable Look-up Table (LUT) based PIM architecture. Our ISA performs the three major tasks of i) controlling the flow of data between the memory and the PIM units, ii) reprogramming the LUTs to perform various operations required for a particular application, and iii) executing sequential steps of operation within the PIM device. A microcoded architecture of the Controller/Sequencer unit ensures minimum circuit overhead as well as offers programmability to support any custom operation. We provide a case study of CNN inferences, large matrix multiplications, and bitwise computations on the PIM architecture equipped with our ISA and present performance evaluations based on this setup. We also compare the performances with several other PIM architectures.
{"title":"Flexible Instruction Set Architecture for Programmable Look-up Table based Processing-in-Memory","authors":"Mark Connolly, Purab Ranjan Sutradhar, Mark A. Indovina, A. Ganguly","doi":"10.1109/ICCD53106.2021.00022","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00022","url":null,"abstract":"Processing in Memory (PIM) is a recent novel computing paradigm that is still in its nascent stage of development. Therefore, there has been an observable lack of standardized and modular Instruction Set Architectures (ISA) for the PIM devices. In this work, we present the design of an ISA which is primarily aimed at a recent programmable Look-up Table (LUT) based PIM architecture. Our ISA performs the three major tasks of i) controlling the flow of data between the memory and the PIM units, ii) reprogramming the LUTs to perform various operations required for a particular application, and iii) executing sequential steps of operation within the PIM device. A microcoded architecture of the Controller/Sequencer unit ensures minimum circuit overhead as well as offers programmability to support any custom operation. We provide a case study of CNN inferences, large matrix multiplications, and bitwise computations on the PIM architecture equipped with our ISA and present performance evaluations based on this setup. We also compare the performances with several other PIM architectures.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124857380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00019
Cheng-Yen Lee, K. Bharathi, Joellen S. Lansford, S. Khatri
Random Number Generators (RNGs) are an essential part of many embedded applications and are used for security, encryption, and built-in test applications. The output of RNGs can be tested for randomness using the well-known NIST statistical test suite. Embedded applications using True Random Number Generators (TRNGs) need to test the randomness of their TRNGs periodically, because their randomness properties can drift over time. Using the full NIST test suite is unpracticed for this purpose, because the full NIST test suite is computationally intensive, and embedded systems (especially real-time systems) often have stringent constraints on the energy and runtime of the programs that are executed on them. In this paper, we propose novel algorithms to select the most effective subset of the NIST test suite, which works within specified runtime and energy budgets. To achieve this, we rank the NIST tests based on multiple metrics, including p-value/Time, p-value/Energy, p-value/Time2 and p-value/Energy2. Based on the total runtime or energy constraint specified by the user, our algorithms proceed to choose a subset of the NIST tests using this rank order. We call this subset of NIST tests as NIST-Lite. Our algorithms also take into account the runtime and energy required to generate the random sequences required (on the same platform) by the NIST-Lite tests. We evaluate the effectiveness of our method against the full NIST test suite (referred to as NIST-Full) and also against a greedily chosen subset of the NIST test suite (referred to as NIST-Greedy). We explore different variants of NIST-Lite. On average, using the same input sequences, the p-value obtained for the 4 best variants of NIST-Lite is 2× and 7× better than the p-value of NIST-Full and NIST-Greedy respectively. NIST-Lite also achieves 158× (204×) runtime (energy) reduction compared to the NIST-Full. Further, we study the performance of NIST-Lite and NIST-Full for deterministic (non-random) input sequences. For such sequences, the pass rate of the NIST-Lite tests is within 16% of the pass rate of NIST-Full on the same sequences, indicating that our NIST-Lite tests have a similar diagnostic ability as NIST-Full.
{"title":"NIST-Lite: Randomness Testing of RNGs on an Energy-Constrained Platform","authors":"Cheng-Yen Lee, K. Bharathi, Joellen S. Lansford, S. Khatri","doi":"10.1109/ICCD53106.2021.00019","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00019","url":null,"abstract":"Random Number Generators (RNGs) are an essential part of many embedded applications and are used for security, encryption, and built-in test applications. The output of RNGs can be tested for randomness using the well-known NIST statistical test suite. Embedded applications using True Random Number Generators (TRNGs) need to test the randomness of their TRNGs periodically, because their randomness properties can drift over time. Using the full NIST test suite is unpracticed for this purpose, because the full NIST test suite is computationally intensive, and embedded systems (especially real-time systems) often have stringent constraints on the energy and runtime of the programs that are executed on them. In this paper, we propose novel algorithms to select the most effective subset of the NIST test suite, which works within specified runtime and energy budgets. To achieve this, we rank the NIST tests based on multiple metrics, including p-value/Time, p-value/Energy, p-value/Time2 and p-value/Energy2. Based on the total runtime or energy constraint specified by the user, our algorithms proceed to choose a subset of the NIST tests using this rank order. We call this subset of NIST tests as NIST-Lite. Our algorithms also take into account the runtime and energy required to generate the random sequences required (on the same platform) by the NIST-Lite tests. We evaluate the effectiveness of our method against the full NIST test suite (referred to as NIST-Full) and also against a greedily chosen subset of the NIST test suite (referred to as NIST-Greedy). We explore different variants of NIST-Lite. On average, using the same input sequences, the p-value obtained for the 4 best variants of NIST-Lite is 2× and 7× better than the p-value of NIST-Full and NIST-Greedy respectively. NIST-Lite also achieves 158× (204×) runtime (energy) reduction compared to the NIST-Full. Further, we study the performance of NIST-Lite and NIST-Full for deterministic (non-random) input sequences. For such sequences, the pass rate of the NIST-Lite tests is within 16% of the pass rate of NIST-Full on the same sequences, indicating that our NIST-Lite tests have a similar diagnostic ability as NIST-Full.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127804822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00030
Teodor-Dumitru Ene, J. Stine
Parallel prefix tree adders allow for high- performance computation due to their logarithmic delay. Modern literature focuses on a well-known group of adder tree networks, with adder taxonomies unable to adequately describe intermediary structures. Efforts to explore novel structures focus mainly on the hybridization of these widely-studied networks. This paper presents a method of generating any valid adder tree network by using a set of three, simple, point-targeted transforms. This method allows for possibilities such as the generation and classification of any hybrid or novel architecture, or the incremental refinement of pre-existing structures to better meet performance targets. Synthesis implementation results are presented on the SkyWater 90nm technology.
{"title":"A Comprehensive Exploration of the Parallel Prefix Adder Tree Space","authors":"Teodor-Dumitru Ene, J. Stine","doi":"10.1109/ICCD53106.2021.00030","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00030","url":null,"abstract":"Parallel prefix tree adders allow for high- performance computation due to their logarithmic delay. Modern literature focuses on a well-known group of adder tree networks, with adder taxonomies unable to adequately describe intermediary structures. Efforts to explore novel structures focus mainly on the hybridization of these widely-studied networks. This paper presents a method of generating any valid adder tree network by using a set of three, simple, point-targeted transforms. This method allows for possibilities such as the generation and classification of any hybrid or novel architecture, or the incremental refinement of pre-existing structures to better meet performance targets. Synthesis implementation results are presented on the SkyWater 90nm technology.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115846947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00092
Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, Huazhong Yang
Sampled Dense-Dense Matrix Multiplication (SDDMM) is a core component of many machine learning systems. SDDMM exposes a substantial amount of parallelism that favors throughput-oriented architectures like the GPU. However, accelerating it on GPUs is challenging in two aspects: the poor memory access locality caused by the sparse sampling matrix with the poor parallelism caused by the dot-product reduction of vectors in two dense matrices. To address both challenges, we present PRedS to boost SDDMM efficiency with a suite of Parallel Reduction Scheduling optimizations. PRedS uses Vectorized Coarsen 1-Dimensional Tiling (VCT) to benefit the online locality of loading the dense matrix. PRedS uses Integrated Interleaving Reduction (IIR) to increase thread occupancy in the parallel reduction. PRedS also leverages Warp-Merged Tiling (WMT) to preserve occupancy and parallelism when reducing very long arrays. Enhanced with GPU-intrinsic vectorized memory loading, PRedS achieves a geometric speedup of 29.20× compared to the vendor library. PRedS achieves up to 8.31× speedup over state-of-the-art implementations on the SuiteSparse benchmark.
{"title":"Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs","authors":"Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, Huazhong Yang","doi":"10.1109/ICCD53106.2021.00092","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00092","url":null,"abstract":"Sampled Dense-Dense Matrix Multiplication (SDDMM) is a core component of many machine learning systems. SDDMM exposes a substantial amount of parallelism that favors throughput-oriented architectures like the GPU. However, accelerating it on GPUs is challenging in two aspects: the poor memory access locality caused by the sparse sampling matrix with the poor parallelism caused by the dot-product reduction of vectors in two dense matrices. To address both challenges, we present PRedS to boost SDDMM efficiency with a suite of Parallel Reduction Scheduling optimizations. PRedS uses Vectorized Coarsen 1-Dimensional Tiling (VCT) to benefit the online locality of loading the dense matrix. PRedS uses Integrated Interleaving Reduction (IIR) to increase thread occupancy in the parallel reduction. PRedS also leverages Warp-Merged Tiling (WMT) to preserve occupancy and parallelism when reducing very long arrays. Enhanced with GPU-intrinsic vectorized memory loading, PRedS achieves a geometric speedup of 29.20× compared to the vendor library. PRedS achieves up to 8.31× speedup over state-of-the-art implementations on the SuiteSparse benchmark.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116570257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00035
Bingzhe Li, Bo Yuan, D. Du
NAND-based flash memory has become a prevalent storage media due to its low access latency and high performance. By setting up different incremental step pulse programming (ISPP) values and threshold voltages, the tradeoffs between lifetime and access latency in NAND-based flash memory can be exploited. The existing studies that exploit the tradeoffs by using heuristic algorithms do not consider the dynamically changed access latency due to wearing-out, resulting in low access performance. In this paper, we proposed a new Elastic Flash Management scheme, called EFM, to manage data in hybrid flash memory, which consists of multiple physical regions with different read/write latencies according to their ISPP values and threshold voltages. EFM includes a Long-Term Classifier (LT-Classifier) and a Short-Term Classifier (ST-Classifier) to accurately track dynamically changed workloads by considering current quantitative differences of read/write latencies and workload access patterns. Moreover, a reduced effective wearing management is proposed to prolong the lifetime of flash memory by scheduling write-intensive workloads to the region with a reduced threshold voltage and the lowest write cost. Experimental results indicate that EFM reduces the average read/write latencies by about 54% - 296% and obtain 17.7% lifetime improvement on average compared to the existing studies.
{"title":"EFM: Elastic Flash Management to Enhance Performance of Hybrid Flash Memory","authors":"Bingzhe Li, Bo Yuan, D. Du","doi":"10.1109/ICCD53106.2021.00035","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00035","url":null,"abstract":"NAND-based flash memory has become a prevalent storage media due to its low access latency and high performance. By setting up different incremental step pulse programming (ISPP) values and threshold voltages, the tradeoffs between lifetime and access latency in NAND-based flash memory can be exploited. The existing studies that exploit the tradeoffs by using heuristic algorithms do not consider the dynamically changed access latency due to wearing-out, resulting in low access performance. In this paper, we proposed a new Elastic Flash Management scheme, called EFM, to manage data in hybrid flash memory, which consists of multiple physical regions with different read/write latencies according to their ISPP values and threshold voltages. EFM includes a Long-Term Classifier (LT-Classifier) and a Short-Term Classifier (ST-Classifier) to accurately track dynamically changed workloads by considering current quantitative differences of read/write latencies and workload access patterns. Moreover, a reduced effective wearing management is proposed to prolong the lifetime of flash memory by scheduling write-intensive workloads to the region with a reduced threshold voltage and the lowest write cost. Experimental results indicate that EFM reduces the average read/write latencies by about 54% - 296% and obtain 17.7% lifetime improvement on average compared to the existing studies.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128703823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00031
Jiafeng Xie, Pengzhou He, Chiou-Yng Lee
The rapid advancement in quantum technology has initiated a new round of post-quantum cryptography (PQC) related exploration. The key encapsulation mechanism (KEM) Saber is an important module lattice-based PQC, which has been selected as one of the PQC finalists in the ongoing National Institute of Standards and Technology (NIST) standardization process. On the other hand, however, efficient hardware implementation of KEM Saber has not been well covered in the literature. In this paper, therefore, we propose a novel cyclic-row oriented processing (CROP) strategy for efficient implementation of the key arithmetic operation of KEM Saber, i.e., the polynomial multiplication. The proposed work consists of three layers of interdependent efforts: (i) first of all, we have formulated the main operation of KEM Saber into desired mathematical forms to be further developed into CROP based algorithms, i.e., the basic version and the advanced higher-speed version; (ii) then, we have followed the proposed CROP strategy to innovatively transfer the derived two algorithms into desired polynomial multiplication structures with the help of a series of algorithm-architecture co-implementation techniques; (iii) finally, detailed complexity analysis and implementation results have shown that the proposed polynomial multiplication structures have better area-time complexities than the state-of-the-art solutions. Specifically, the field-programmable gate array (FPGA) implementation results show that the proposed design, e.g., the basic version has at least less 11.2% area-delay product (ADP) than the best competing one (Cyclone V device). The proposed high-performance polynomial multipliers offer not only efficient operation for output results delivery but also possess low-complexity feature brought by CROP strategy. The outcome of this work is expected to provide useful references for further development and standardization process of KEM Saber.
{"title":"CROP: FPGA Implementation of High-Performance Polynomial Multiplication in Saber KEM based on Novel Cyclic-Row Oriented Processing Strategy","authors":"Jiafeng Xie, Pengzhou He, Chiou-Yng Lee","doi":"10.1109/ICCD53106.2021.00031","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00031","url":null,"abstract":"The rapid advancement in quantum technology has initiated a new round of post-quantum cryptography (PQC) related exploration. The key encapsulation mechanism (KEM) Saber is an important module lattice-based PQC, which has been selected as one of the PQC finalists in the ongoing National Institute of Standards and Technology (NIST) standardization process. On the other hand, however, efficient hardware implementation of KEM Saber has not been well covered in the literature. In this paper, therefore, we propose a novel cyclic-row oriented processing (CROP) strategy for efficient implementation of the key arithmetic operation of KEM Saber, i.e., the polynomial multiplication. The proposed work consists of three layers of interdependent efforts: (i) first of all, we have formulated the main operation of KEM Saber into desired mathematical forms to be further developed into CROP based algorithms, i.e., the basic version and the advanced higher-speed version; (ii) then, we have followed the proposed CROP strategy to innovatively transfer the derived two algorithms into desired polynomial multiplication structures with the help of a series of algorithm-architecture co-implementation techniques; (iii) finally, detailed complexity analysis and implementation results have shown that the proposed polynomial multiplication structures have better area-time complexities than the state-of-the-art solutions. Specifically, the field-programmable gate array (FPGA) implementation results show that the proposed design, e.g., the basic version has at least less 11.2% area-delay product (ADP) than the best competing one (Cyclone V device). The proposed high-performance polynomial multipliers offer not only efficient operation for output results delivery but also possess low-complexity feature brought by CROP strategy. The outcome of this work is expected to provide useful references for further development and standardization process of KEM Saber.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125940752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}