2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献_第7页

Block-LSM: An Ether-aware Block-ordered LSM-tree based Key-Value Storage Engine Block-LSM:基于以太感知的块排序lsm树的键值存储引擎

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00017

Zehao Chen, Bingzhe Li, Xiaojun Cai, Zhiping Jia, Zhaoyan Shen, Yi Wang, Z. Shao

Ethereum as one of the largest blockchain systems plays an important role in the distributed ledger, database systems, etc. As more and more blocks are mined, the storage burden of Ethereum is significantly increased. The current Ethereum system uniformly transforms all its data into key-value (KV) items and stores them to the underlying Log-Structure Merged tree (LSM-tree) storage engine ignoring the software semantics. Consequently, it not only exacerbates the write amplification effect of the storage engine but also hurts the performance of Ethereum. In this paper, we proposed a new Ethereum-aware storage model called Block-LSM, which significantly improves the data synchronization of the Ethereum system. Specifically, we first design a shared prefix scheme to transform Ethereum data into ordered KV pairs to alleviate the key range overlaps of different levels in the underlying LSM-tree based storage engine. Moreover, we propose to maintain several semantic-orientated memory buffers to isolate different kinds of Ethereum data. To save space overhead, Block-LSM further aggregates multiple blocks into a group and assigns the same prefix to all KV items from the same block group. Finally, we implement Block-LSM in the real Ethereum environment and conduct a series of experiments. The evaluation results show that Block-LSM significantly reduces up to 3.7× storage write amplification and increases throughput by 3× compared with the original Ethereum design.

以太坊作为最大的区块链系统之一，在分布式账本、数据库系统等方面发挥着重要作用。随着越来越多的区块被开采，以太坊的存储负担显著增加。当前的以太坊系统将其所有数据统一转换为键值(KV)项，并将其存储到底层日志结构合并树(LSM-tree)存储引擎中，而忽略了软件语义。因此，它不仅加剧了存储引擎的写放大效应，而且损害了以太坊的性能。本文提出了一种新的以太坊感知存储模型Block-LSM，显著改善了以太坊系统的数据同步。具体来说，我们首先设计了一个共享前缀方案，将以太坊数据转换为有序的KV对，以缓解底层基于lsm树的存储引擎中不同级别的键范围重叠。此外，我们建议维护几个面向语义的内存缓冲区，以隔离不同类型的以太坊数据。为了节省空间开销，block - lsm进一步将多个块聚合到一个组中，并为同一块组中的所有KV项分配相同的前缀。最后，我们在真实的以太坊环境中实现了Block-LSM，并进行了一系列实验。评估结果表明，与最初的以太坊设计相比，Block-LSM显著降低了3.7倍的存储写入放大，并将吞吐量提高了3倍。

{"title":"Block-LSM: An Ether-aware Block-ordered LSM-tree based Key-Value Storage Engine","authors":"Zehao Chen, Bingzhe Li, Xiaojun Cai, Zhiping Jia, Zhaoyan Shen, Yi Wang, Z. Shao","doi":"10.1109/ICCD53106.2021.00017","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00017","url":null,"abstract":"Ethereum as one of the largest blockchain systems plays an important role in the distributed ledger, database systems, etc. As more and more blocks are mined, the storage burden of Ethereum is significantly increased. The current Ethereum system uniformly transforms all its data into key-value (KV) items and stores them to the underlying Log-Structure Merged tree (LSM-tree) storage engine ignoring the software semantics. Consequently, it not only exacerbates the write amplification effect of the storage engine but also hurts the performance of Ethereum. In this paper, we proposed a new Ethereum-aware storage model called Block-LSM, which significantly improves the data synchronization of the Ethereum system. Specifically, we first design a shared prefix scheme to transform Ethereum data into ordered KV pairs to alleviate the key range overlaps of different levels in the underlying LSM-tree based storage engine. Moreover, we propose to maintain several semantic-orientated memory buffers to isolate different kinds of Ethereum data. To save space overhead, Block-LSM further aggregates multiple blocks into a group and assigns the same prefix to all KV items from the same block group. Finally, we implement Block-LSM in the real Ethereum environment and conduct a series of experiments. The evaluation results show that Block-LSM significantly reduces up to 3.7× storage write amplification and increases throughput by 3× compared with the original Ethereum design.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114309622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Model Synthesis for Communication Traces of System Designs 系统设计通信轨迹的模型综合

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00082

Hao Zheng, Md Rubel Ahmed, P. Mukherjee, M. Ketkar, Jin Yang

Concise and abstract models of system-level behaviors are invaluable in design analysis, testing, and validation. In this paper, we consider the problem of inferring models from communication traces of system-on-chip (SoC) designs. The traces capture communications among different blocks of a system design in terms of messages exchanged. The extracted models characterize the system-level communication protocols governing how blocks exchange messages, and coordinate with each other to realize various system functions. In this paper, the above problem is formulated as a constraint satisfaction problem, which is then fed to a satisfiability modulo theories (SMT) solver. The solutions returned by the SMT solver are used to extract the models that accept the input traces. In the experiments, we demonstrate the proposed approach with traces collected from a transaction-level simulation model of a multicore SoC design and a trace of a more detailed multicore SoC modeled in GEM5.

系统级行为的简洁和抽象模型在设计分析、测试和验证中是无价的。在本文中，我们考虑了从片上系统(SoC)设计的通信轨迹推断模型的问题。跟踪记录了系统设计中不同模块之间的通信，以交换消息的方式进行。提取的模型描述了系统级通信协议，这些协议控制了块之间如何交换消息，并相互协调以实现各种系统功能。本文将上述问题表述为约束满足问题，然后将其交给可满足模理论(SMT)求解器。SMT求解器返回的解决方案用于提取接受输入轨迹的模型。在实验中，我们使用从多核SoC设计的事务级仿真模型收集的跟踪和GEM5中建模的更详细的多核SoC跟踪来演示所提出的方法。

引用次数: 4

HammerFilter: Robust Protection and Low Hardware Overhead Method for RowHammer 锤式过滤器:鲁棒保护和低硬件开销的方法

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00043

Kwangrae Kim, Jeonghyun Woo, Junsu Kim, Ki-Seok Chung

The continuous scaling-down of the dynamic random access memory (DRAM) manufacturing process has made it possible to improve DRAM density. However, it makes small DRAM cells susceptible to electromagnetic interference between nearby cells. Unless DRAM cells are adequately isolated from each other, the frequent switching access of some cells may lead to unintended bit flips in adjacent cells. This phenomenon is commonly referred to as RowHammer. It is often considered a security issue because unusually frequent accesses to a small set of rows generated by malicious attacks can cause bit flips. Such bit flips may also be caused by general applications. Although several solutions have been proposed, most approaches either incur excessive area overhead or exhibit limited prevention capabilities against maliciously crafted attack patterns. Therefore, the goals of this study are (1) to mitigate RowHammer, even when the number of aggressor rows increases and attack patterns become complicated, and (2) to implement the method with a low area overhead.We propose a robust hardware-based protection method for RowHammer attacks with a low hardware cost called HammerFilter, which employs a modified version of the counting bloom filter. It tracks all attacking rows efficiently by leveraging the fact that the counting bloom filter is a space-efficient data structure, and we add an operation, HALF-DELETE, to mitigate the energy overhead. According to our experimental results, the proposed method can completely prevent bit flips when facing artificially crafted attack patterns (five patterns in our experiments), whereas state-of-the-art probabilistic solutions can only mitigate less than 56% of bit flips on average. Furthermore, the proposed method has a much lower area cost compared to existing counter-based solutions (40.6× better than TWiCe and 2.3× better than Graphene).

动态随机存取存储器(DRAM)制造工艺的不断缩小使得提高DRAM密度成为可能。然而，它使小型DRAM单元容易受到附近单元之间的电磁干扰。除非DRAM单元彼此充分隔离，否则某些单元的频繁开关访问可能导致相邻单元中意外的位翻转。这种现象通常被称为RowHammer。它通常被认为是一个安全问题，因为对恶意攻击生成的一小部分行的异常频繁的访问可能导致位翻转。这种位翻转也可能由一般应用引起。尽管已经提出了几种解决方案，但大多数方法要么导致过多的面积开销，要么对恶意攻击模式的防御能力有限。因此，本研究的目标是(1)减轻RowHammer，即使攻击者行数增加且攻击模式变得复杂，(2)以低面积开销实现该方法。我们提出了一种鲁棒的基于硬件的RowHammer攻击保护方法，其硬件成本较低，称为HammerFilter，它采用了改进版本的计数布隆滤波器。它利用计数布隆过滤器是一种节省空间的数据结构这一事实，有效地跟踪所有攻击行，并且我们添加了一个操作HALF-DELETE，以减轻能量开销。根据我们的实验结果，当面对人为制造的攻击模式(我们的实验中有五种模式)时，所提出的方法可以完全防止比特翻转，而最先进的概率解决方案平均只能减轻不到56%的比特翻转。此外，与现有的基于计数器的解决方案相比，所提出的方法具有更低的面积成本(比TWiCe好40.6倍，比石墨烯好2.3倍)。

{"title":"HammerFilter: Robust Protection and Low Hardware Overhead Method for RowHammer","authors":"Kwangrae Kim, Jeonghyun Woo, Junsu Kim, Ki-Seok Chung","doi":"10.1109/ICCD53106.2021.00043","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00043","url":null,"abstract":"The continuous scaling-down of the dynamic random access memory (DRAM) manufacturing process has made it possible to improve DRAM density. However, it makes small DRAM cells susceptible to electromagnetic interference between nearby cells. Unless DRAM cells are adequately isolated from each other, the frequent switching access of some cells may lead to unintended bit flips in adjacent cells. This phenomenon is commonly referred to as RowHammer. It is often considered a security issue because unusually frequent accesses to a small set of rows generated by malicious attacks can cause bit flips. Such bit flips may also be caused by general applications. Although several solutions have been proposed, most approaches either incur excessive area overhead or exhibit limited prevention capabilities against maliciously crafted attack patterns. Therefore, the goals of this study are (1) to mitigate RowHammer, even when the number of aggressor rows increases and attack patterns become complicated, and (2) to implement the method with a low area overhead.We propose a robust hardware-based protection method for RowHammer attacks with a low hardware cost called HammerFilter, which employs a modified version of the counting bloom filter. It tracks all attacking rows efficiently by leveraging the fact that the counting bloom filter is a space-efficient data structure, and we add an operation, HALF-DELETE, to mitigate the energy overhead. According to our experimental results, the proposed method can completely prevent bit flips when facing artificially crafted attack patterns (five patterns in our experiments), whereas state-of-the-art probabilistic solutions can only mitigate less than 56% of bit flips on average. Furthermore, the proposed method has a much lower area cost compared to existing counter-based solutions (40.6× better than TWiCe and 2.3× better than Graphene).","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133106368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Smart-DNN: Efficiently Reducing the Memory Requirements of Running Deep Neural Networks on Resource-constrained Platforms Smart-DNN:有效降低在资源受限平台上运行深度神经网络的内存需求

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00087

Zhenbo Hu, Xiangyu Zou, Wen Xia, Yuhong Zhao, Weizhe Zhang, Donglei Wu

Deep neural networks (DNNs) have gained considerable attention in various real-world applications due to their strong performance in representation learning. However, running a DNN needs tremendous memory resources, which significantly restricts DNN from being applicable on resource-constrained platforms (e.g., IoT, mobile devices, etc.). Lightweight DNNs can accommodate the characteristics of mobile devices, but the hardware resources of mobile or IoT devices are extremely limited, and the resource consumption of lightweight models needs to be further reduced. However, the current neural network compression approaches (i.e., pruning, quantization, knowledge distillation, etc.) works poorly on the lightweight DNNs, which are already simplified. In this paper, we present a novel framework called Smart-DNN, which can efficiently reduce the memory requirements of running DNNs on resource-constrained platforms. Specifically, we slice a neural network into several segments and use SZ error-bounded lossy compression to compress each segment separately while keeping the network structure unchanged. When running a network, we first store the compressed network into memory and then partially decompress the corresponding part layer by layer. According to experimental results on four popular lightweight DNNs (usually used in resource-constrained platforms), Smart-DNN achieves memory saving of 1/10∼1/5, while slightly sacrificing inference accuracy and unchanging the neural network structure with accepted extra runtime overhead.

深度神经网络(dnn)由于其在表征学习方面的优异表现，在各种现实应用中获得了相当大的关注。然而，运行DNN需要大量的内存资源，这极大地限制了DNN在资源受限平台(例如物联网、移动设备等)上的应用。轻量级dnn可以适应移动设备的特点，但移动或物联网设备的硬件资源极为有限，需要进一步降低轻量级模型的资源消耗。然而，目前的神经网络压缩方法(即修剪、量化、知识蒸馏等)在已经简化的轻量级dnn上效果不佳。在本文中，我们提出了一个名为Smart-DNN的新框架，它可以有效地降低在资源受限的平台上运行dnn的内存需求。具体来说，我们将神经网络分割成若干段，并在保持网络结构不变的情况下，使用SZ错误有界有损压缩对每个段分别进行压缩。在运行网络时，我们首先将压缩后的网络存储到内存中，然后逐层部分解压缩相应的部分。根据四种流行的轻量级dnn(通常用于资源受限平台)的实验结果，Smart-DNN实现了1/10 ~ 1/5的内存节省，同时略微牺牲了推理精度，并且在接受额外的运行时开销的情况下保持了神经网络结构的不变。

{"title":"Smart-DNN: Efficiently Reducing the Memory Requirements of Running Deep Neural Networks on Resource-constrained Platforms","authors":"Zhenbo Hu, Xiangyu Zou, Wen Xia, Yuhong Zhao, Weizhe Zhang, Donglei Wu","doi":"10.1109/ICCD53106.2021.00087","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00087","url":null,"abstract":"Deep neural networks (DNNs) have gained considerable attention in various real-world applications due to their strong performance in representation learning. However, running a DNN needs tremendous memory resources, which significantly restricts DNN from being applicable on resource-constrained platforms (e.g., IoT, mobile devices, etc.). Lightweight DNNs can accommodate the characteristics of mobile devices, but the hardware resources of mobile or IoT devices are extremely limited, and the resource consumption of lightweight models needs to be further reduced. However, the current neural network compression approaches (i.e., pruning, quantization, knowledge distillation, etc.) works poorly on the lightweight DNNs, which are already simplified. In this paper, we present a novel framework called Smart-DNN, which can efficiently reduce the memory requirements of running DNNs on resource-constrained platforms. Specifically, we slice a neural network into several segments and use SZ error-bounded lossy compression to compress each segment separately while keeping the network structure unchanged. When running a network, we first store the compressed network into memory and then partially decompress the corresponding part layer by layer. According to experimental results on four popular lightweight DNNs (usually used in resource-constrained platforms), Smart-DNN achieves memory saving of 1/10∼1/5, while slightly sacrificing inference accuracy and unchanging the neural network structure with accepted extra runtime overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"22 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113976420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Flexible Instruction Set Architecture for Programmable Look-up Table based Processing-in-Memory 基于内存处理的可编程查找表灵活指令集体系结构

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00022

Mark Connolly, Purab Ranjan Sutradhar, Mark A. Indovina, A. Ganguly

Processing in Memory (PIM) is a recent novel computing paradigm that is still in its nascent stage of development. Therefore, there has been an observable lack of standardized and modular Instruction Set Architectures (ISA) for the PIM devices. In this work, we present the design of an ISA which is primarily aimed at a recent programmable Look-up Table (LUT) based PIM architecture. Our ISA performs the three major tasks of i) controlling the flow of data between the memory and the PIM units, ii) reprogramming the LUTs to perform various operations required for a particular application, and iii) executing sequential steps of operation within the PIM device. A microcoded architecture of the Controller/Sequencer unit ensures minimum circuit overhead as well as offers programmability to support any custom operation. We provide a case study of CNN inferences, large matrix multiplications, and bitwise computations on the PIM architecture equipped with our ISA and present performance evaluations based on this setup. We also compare the performances with several other PIM architectures.

内存处理(PIM)是一种新兴的计算范式，目前仍处于发展的初级阶段。因此，PIM设备缺乏标准化和模块化的指令集架构(ISA)。在这项工作中，我们提出了一个ISA的设计，主要针对最近基于可编程查找表(LUT)的PIM体系结构。我们的ISA执行三个主要任务:i)控制内存和PIM单元之间的数据流，ii)重新编程lut以执行特定应用程序所需的各种操作，以及iii)在PIM设备内执行连续的操作步骤。控制器/音序器单元的微编码架构确保最小的电路开销，并提供可编程性，以支持任何自定义操作。我们提供了一个CNN推理、大矩阵乘法和PIM架构上的位计算的案例研究，并基于此设置给出了性能评估。我们还将其性能与其他几种PIM体系结构进行了比较。

{"title":"Flexible Instruction Set Architecture for Programmable Look-up Table based Processing-in-Memory","authors":"Mark Connolly, Purab Ranjan Sutradhar, Mark A. Indovina, A. Ganguly","doi":"10.1109/ICCD53106.2021.00022","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00022","url":null,"abstract":"Processing in Memory (PIM) is a recent novel computing paradigm that is still in its nascent stage of development. Therefore, there has been an observable lack of standardized and modular Instruction Set Architectures (ISA) for the PIM devices. In this work, we present the design of an ISA which is primarily aimed at a recent programmable Look-up Table (LUT) based PIM architecture. Our ISA performs the three major tasks of i) controlling the flow of data between the memory and the PIM units, ii) reprogramming the LUTs to perform various operations required for a particular application, and iii) executing sequential steps of operation within the PIM device. A microcoded architecture of the Controller/Sequencer unit ensures minimum circuit overhead as well as offers programmability to support any custom operation. We provide a case study of CNN inferences, large matrix multiplications, and bitwise computations on the PIM architecture equipped with our ISA and present performance evaluations based on this setup. We also compare the performances with several other PIM architectures.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124857380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NIST-Lite: Randomness Testing of RNGs on an Energy-Constrained Platform NIST-Lite:能量受限平台上rng的随机性测试

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00019

Cheng-Yen Lee, K. Bharathi, Joellen S. Lansford, S. Khatri

Random Number Generators (RNGs) are an essential part of many embedded applications and are used for security, encryption, and built-in test applications. The output of RNGs can be tested for randomness using the well-known NIST statistical test suite. Embedded applications using True Random Number Generators (TRNGs) need to test the randomness of their TRNGs periodically, because their randomness properties can drift over time. Using the full NIST test suite is unpracticed for this purpose, because the full NIST test suite is computationally intensive, and embedded systems (especially real-time systems) often have stringent constraints on the energy and runtime of the programs that are executed on them. In this paper, we propose novel algorithms to select the most effective subset of the NIST test suite, which works within specified runtime and energy budgets. To achieve this, we rank the NIST tests based on multiple metrics, including p-value/Time, p-value/Energy, p-value/Time2 and p-value/Energy2. Based on the total runtime or energy constraint specified by the user, our algorithms proceed to choose a subset of the NIST tests using this rank order. We call this subset of NIST tests as NIST-Lite. Our algorithms also take into account the runtime and energy required to generate the random sequences required (on the same platform) by the NIST-Lite tests. We evaluate the effectiveness of our method against the full NIST test suite (referred to as NIST-Full) and also against a greedily chosen subset of the NIST test suite (referred to as NIST-Greedy). We explore different variants of NIST-Lite. On average, using the same input sequences, the p-value obtained for the 4 best variants of NIST-Lite is 2× and 7× better than the p-value of NIST-Full and NIST-Greedy respectively. NIST-Lite also achieves 158× (204×) runtime (energy) reduction compared to the NIST-Full. Further, we study the performance of NIST-Lite and NIST-Full for deterministic (non-random) input sequences. For such sequences, the pass rate of the NIST-Lite tests is within 16% of the pass rate of NIST-Full on the same sequences, indicating that our NIST-Lite tests have a similar diagnostic ability as NIST-Full.

随机数生成器(rng)是许多嵌入式应用程序的重要组成部分，用于安全性、加密和内置测试应用程序。rng的输出可以使用著名的NIST统计测试套件来测试随机性。使用真随机数生成器(trng)的嵌入式应用程序需要定期测试其trng的随机性，因为它们的随机性属性可能随着时间的推移而漂移。使用完整的NIST测试套件是没有实践过的，因为完整的NIST测试套件是计算密集型的，嵌入式系统(尤其是实时系统)通常对在其上执行的程序的能量和运行时有严格的限制。在本文中，我们提出了新的算法来选择NIST测试套件中最有效的子集，该子集在指定的运行时间和能量预算内工作。为了实现这一点，我们基于多个指标对NIST测试进行排名，包括p-value/Time、p-value/Energy、p-value/Time2和p-value/Energy2。基于用户指定的总运行时间或能量约束，我们的算法继续使用这个排名顺序选择NIST测试的一个子集。我们把NIST测试的这个子集称为NIST- lite。我们的算法还考虑了生成NIST-Lite测试所需的随机序列(在同一平台上)所需的运行时间和能量。我们针对完整的NIST测试套件(称为NIST- full)以及NIST测试套件中贪婪选择的子集(称为NIST- greedy)来评估我们的方法的有效性。我们探索了NIST-Lite的不同变体。平均而言，在相同的输入序列下，NIST-Lite的4个最佳变体得到的p值分别比NIST-Full和NIST-Greedy的p值高2倍和7倍。与NIST-Full相比，NIST-Lite还实现了158倍(204倍)的运行时间(能量)减少。进一步，我们研究了NIST-Lite和NIST-Full对于确定性(非随机)输入序列的性能。对于这些序列，NIST-Lite测试的通过率与NIST-Full测试的通过率在16%以内，表明我们的NIST-Lite测试具有与NIST-Full相似的诊断能力。

{"title":"NIST-Lite: Randomness Testing of RNGs on an Energy-Constrained Platform","authors":"Cheng-Yen Lee, K. Bharathi, Joellen S. Lansford, S. Khatri","doi":"10.1109/ICCD53106.2021.00019","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00019","url":null,"abstract":"Random Number Generators (RNGs) are an essential part of many embedded applications and are used for security, encryption, and built-in test applications. The output of RNGs can be tested for randomness using the well-known NIST statistical test suite. Embedded applications using True Random Number Generators (TRNGs) need to test the randomness of their TRNGs periodically, because their randomness properties can drift over time. Using the full NIST test suite is unpracticed for this purpose, because the full NIST test suite is computationally intensive, and embedded systems (especially real-time systems) often have stringent constraints on the energy and runtime of the programs that are executed on them. In this paper, we propose novel algorithms to select the most effective subset of the NIST test suite, which works within specified runtime and energy budgets. To achieve this, we rank the NIST tests based on multiple metrics, including p-value/Time, p-value/Energy, p-value/Time2 and p-value/Energy2. Based on the total runtime or energy constraint specified by the user, our algorithms proceed to choose a subset of the NIST tests using this rank order. We call this subset of NIST tests as NIST-Lite. Our algorithms also take into account the runtime and energy required to generate the random sequences required (on the same platform) by the NIST-Lite tests. We evaluate the effectiveness of our method against the full NIST test suite (referred to as NIST-Full) and also against a greedily chosen subset of the NIST test suite (referred to as NIST-Greedy). We explore different variants of NIST-Lite. On average, using the same input sequences, the p-value obtained for the 4 best variants of NIST-Lite is 2× and 7× better than the p-value of NIST-Full and NIST-Greedy respectively. NIST-Lite also achieves 158× (204×) runtime (energy) reduction compared to the NIST-Full. Further, we study the performance of NIST-Lite and NIST-Full for deterministic (non-random) input sequences. For such sequences, the pass rate of the NIST-Lite tests is within 16% of the pass rate of NIST-Full on the same sequences, indicating that our NIST-Lite tests have a similar diagnostic ability as NIST-Full.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127804822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Comprehensive Exploration of the Parallel Prefix Adder Tree Space 并行前缀加法器树空间的综合探索

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00030

Teodor-Dumitru Ene, J. Stine

Parallel prefix tree adders allow for high- performance computation due to their logarithmic delay. Modern literature focuses on a well-known group of adder tree networks, with adder taxonomies unable to adequately describe intermediary structures. Efforts to explore novel structures focus mainly on the hybridization of these widely-studied networks. This paper presents a method of generating any valid adder tree network by using a set of three, simple, point-targeted transforms. This method allows for possibilities such as the generation and classification of any hybrid or novel architecture, or the incremental refinement of pre-existing structures to better meet performance targets. Synthesis implementation results are presented on the SkyWater 90nm technology.

并行前缀树加法器由于其对数延迟而允许高性能计算。现代文献关注的是一组众所周知的加法器树网络，加法器分类法无法充分描述中间结构。探索新结构的努力主要集中在这些被广泛研究的网络的杂交。本文提出了一种利用三个简单的点目标变换来生成任意有效加法树网络的方法。这种方法允许生成和分类任何混合的或新颖的体系结构，或者对已有的结构进行增量细化，以更好地满足性能目标。介绍了在SkyWater 90nm技术上的合成实现结果。

引用次数: 5

Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs 利用gpu上采样密集矩阵乘法的在线局部性和约简并行性

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00092

Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, Huazhong Yang

Sampled Dense-Dense Matrix Multiplication (SDDMM) is a core component of many machine learning systems. SDDMM exposes a substantial amount of parallelism that favors throughput-oriented architectures like the GPU. However, accelerating it on GPUs is challenging in two aspects: the poor memory access locality caused by the sparse sampling matrix with the poor parallelism caused by the dot-product reduction of vectors in two dense matrices. To address both challenges, we present PRedS to boost SDDMM efficiency with a suite of Parallel Reduction Scheduling optimizations. PRedS uses Vectorized Coarsen 1-Dimensional Tiling (VCT) to benefit the online locality of loading the dense matrix. PRedS uses Integrated Interleaving Reduction (IIR) to increase thread occupancy in the parallel reduction. PRedS also leverages Warp-Merged Tiling (WMT) to preserve occupancy and parallelism when reducing very long arrays. Enhanced with GPU-intrinsic vectorized memory loading, PRedS achieves a geometric speedup of 29.20× compared to the vendor library. PRedS achieves up to 8.31× speedup over state-of-the-art implementations on the SuiteSparse benchmark.

采样稠密矩阵乘法(SDDMM)是许多机器学习系统的核心组成部分。SDDMM暴露了大量的并行性，这有利于面向吞吐量的架构，如GPU。然而，在gpu上加速存在两个方面的挑战:稀疏采样矩阵导致的内存访问局部性差，以及两个密集矩阵中向量的点积约简导致的并行性差。为了解决这两个挑战，我们提出了PRedS，通过一套并行减少调度优化来提高SDDMM的效率。PRedS采用矢量化粗森一维平铺(VCT)技术，有利于密集矩阵加载的在线局域化。PRedS使用集成交错减少(IIR)来增加并行减少中的线程占用。PRedS还利用warp - merge Tiling (WMT)在减少非常长的数组时保持占用和并行性。通过gpu固有的矢量化内存加载增强，PRedS与供应商库相比实现了29.20倍的几何加速。PRedS比最先进的实现在SuiteSparse基准上实现了高达8.31倍的加速。

{"title":"Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs","authors":"Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, Huazhong Yang","doi":"10.1109/ICCD53106.2021.00092","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00092","url":null,"abstract":"Sampled Dense-Dense Matrix Multiplication (SDDMM) is a core component of many machine learning systems. SDDMM exposes a substantial amount of parallelism that favors throughput-oriented architectures like the GPU. However, accelerating it on GPUs is challenging in two aspects: the poor memory access locality caused by the sparse sampling matrix with the poor parallelism caused by the dot-product reduction of vectors in two dense matrices. To address both challenges, we present PRedS to boost SDDMM efficiency with a suite of Parallel Reduction Scheduling optimizations. PRedS uses Vectorized Coarsen 1-Dimensional Tiling (VCT) to benefit the online locality of loading the dense matrix. PRedS uses Integrated Interleaving Reduction (IIR) to increase thread occupancy in the parallel reduction. PRedS also leverages Warp-Merged Tiling (WMT) to preserve occupancy and parallelism when reducing very long arrays. Enhanced with GPU-intrinsic vectorized memory loading, PRedS achieves a geometric speedup of 29.20× compared to the vendor library. PRedS achieves up to 8.31× speedup over state-of-the-art implementations on the SuiteSparse benchmark.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116570257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

EFM: Elastic Flash Management to Enhance Performance of Hybrid Flash Memory 弹性闪存管理提高混合闪存的性能

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00035

Bingzhe Li, Bo Yuan, D. Du

NAND-based flash memory has become a prevalent storage media due to its low access latency and high performance. By setting up different incremental step pulse programming (ISPP) values and threshold voltages, the tradeoffs between lifetime and access latency in NAND-based flash memory can be exploited. The existing studies that exploit the tradeoffs by using heuristic algorithms do not consider the dynamically changed access latency due to wearing-out, resulting in low access performance. In this paper, we proposed a new Elastic Flash Management scheme, called EFM, to manage data in hybrid flash memory, which consists of multiple physical regions with different read/write latencies according to their ISPP values and threshold voltages. EFM includes a Long-Term Classifier (LT-Classifier) and a Short-Term Classifier (ST-Classifier) to accurately track dynamically changed workloads by considering current quantitative differences of read/write latencies and workload access patterns. Moreover, a reduced effective wearing management is proposed to prolong the lifetime of flash memory by scheduling write-intensive workloads to the region with a reduced threshold voltage and the lowest write cost. Experimental results indicate that EFM reduces the average read/write latencies by about 54% - 296% and obtain 17.7% lifetime improvement on average compared to the existing studies.

基于nand的闪存由于其低访问延迟和高性能而成为一种流行的存储介质。通过设置不同的增量步进脉冲编程(ISPP)值和阈值电压，可以在nand闪存的寿命和访问延迟之间进行权衡。现有的利用启发式算法进行权衡的研究没有考虑由于磨损而产生的动态变化的访问延迟，导致访问性能较低。本文提出了一种新的弹性闪存管理方案，称为EFM，用于管理混合闪存中的数据，混合闪存由多个物理区域组成，这些物理区域根据其ISPP值和阈值电压具有不同的读写延迟。EFM包括一个长期分类器(LT-Classifier)和一个短期分类器(ST-Classifier)，通过考虑当前读/写延迟和工作负载访问模式的定量差异，精确跟踪动态变化的工作负载。此外，还提出了一种降低有效磨损的管理方法，通过将写密集型工作负载调度到阈值电压较低、写成本最低的区域来延长闪存的使用寿命。实验结果表明，与现有研究相比，EFM将平均读/写延迟降低了54% ~ 296%，平均寿命提高了17.7%。

{"title":"EFM: Elastic Flash Management to Enhance Performance of Hybrid Flash Memory","authors":"Bingzhe Li, Bo Yuan, D. Du","doi":"10.1109/ICCD53106.2021.00035","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00035","url":null,"abstract":"NAND-based flash memory has become a prevalent storage media due to its low access latency and high performance. By setting up different incremental step pulse programming (ISPP) values and threshold voltages, the tradeoffs between lifetime and access latency in NAND-based flash memory can be exploited. The existing studies that exploit the tradeoffs by using heuristic algorithms do not consider the dynamically changed access latency due to wearing-out, resulting in low access performance. In this paper, we proposed a new Elastic Flash Management scheme, called EFM, to manage data in hybrid flash memory, which consists of multiple physical regions with different read/write latencies according to their ISPP values and threshold voltages. EFM includes a Long-Term Classifier (LT-Classifier) and a Short-Term Classifier (ST-Classifier) to accurately track dynamically changed workloads by considering current quantitative differences of read/write latencies and workload access patterns. Moreover, a reduced effective wearing management is proposed to prolong the lifetime of flash memory by scheduling write-intensive workloads to the region with a reduced threshold voltage and the lowest write cost. Experimental results indicate that EFM reduces the average read/write latencies by about 54% - 296% and obtain 17.7% lifetime improvement on average compared to the existing studies.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128703823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CROP: FPGA Implementation of High-Performance Polynomial Multiplication in Saber KEM based on Novel Cyclic-Row Oriented Processing Strategy 基于新的面向循环行处理策略的Saber KEM中高性能多项式乘法的FPGA实现

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00031

Jiafeng Xie, Pengzhou He, Chiou-Yng Lee

The rapid advancement in quantum technology has initiated a new round of post-quantum cryptography (PQC) related exploration. The key encapsulation mechanism (KEM) Saber is an important module lattice-based PQC, which has been selected as one of the PQC finalists in the ongoing National Institute of Standards and Technology (NIST) standardization process. On the other hand, however, efficient hardware implementation of KEM Saber has not been well covered in the literature. In this paper, therefore, we propose a novel cyclic-row oriented processing (CROP) strategy for efficient implementation of the key arithmetic operation of KEM Saber, i.e., the polynomial multiplication. The proposed work consists of three layers of interdependent efforts: (i) first of all, we have formulated the main operation of KEM Saber into desired mathematical forms to be further developed into CROP based algorithms, i.e., the basic version and the advanced higher-speed version; (ii) then, we have followed the proposed CROP strategy to innovatively transfer the derived two algorithms into desired polynomial multiplication structures with the help of a series of algorithm-architecture co-implementation techniques; (iii) finally, detailed complexity analysis and implementation results have shown that the proposed polynomial multiplication structures have better area-time complexities than the state-of-the-art solutions. Specifically, the field-programmable gate array (FPGA) implementation results show that the proposed design, e.g., the basic version has at least less 11.2% area-delay product (ADP) than the best competing one (Cyclone V device). The proposed high-performance polynomial multipliers offer not only efficient operation for output results delivery but also possess low-complexity feature brought by CROP strategy. The outcome of this work is expected to provide useful references for further development and standardization process of KEM Saber.

量子技术的飞速发展引发了新一轮的后量子密码学(PQC)相关探索。关键封装机制(KEM) Saber是一种重要的基于格子模块的PQC，已被选为正在进行的美国国家标准与技术研究院(NIST)标准化过程中的PQC决赛选手之一。另一方面，然而，有效的硬件实现KEM军刀还没有很好地覆盖在文献中。因此，在本文中，我们提出了一种新的面向循环行处理(CROP)策略，用于有效地实现KEM Saber的关键算术运算，即多项式乘法。提出的工作包括三层相互依存的努力:(i)首先，我们将KEM Saber的主要操作制定为所需的数学形式，以进一步发展为基于CROP的算法，即基本版本和高级高速版本;(ii)然后，我们遵循提出的CROP策略，在一系列算法架构协同实现技术的帮助下，创新地将导出的两种算法转换为所需的多项式乘法结构;(iii)最后，详细的复杂性分析和实施结果表明，所提出的多项式乘法结构比最先进的解决方案具有更好的面积-时间复杂性。具体而言，现场可编程门阵列(FPGA)的实现结果表明，所提出的设计，例如，基本版本的面积延迟积(ADP)至少低于最佳竞争版本(Cyclone V器件)的11.2%。所提出的高性能多项式乘法器不仅具有高效的输出结果传递，而且具有CROP策略带来的低复杂度特征。本文的研究成果有望为KEM Saber的进一步开发和标准化进程提供有益的参考。

{"title":"CROP: FPGA Implementation of High-Performance Polynomial Multiplication in Saber KEM based on Novel Cyclic-Row Oriented Processing Strategy","authors":"Jiafeng Xie, Pengzhou He, Chiou-Yng Lee","doi":"10.1109/ICCD53106.2021.00031","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00031","url":null,"abstract":"The rapid advancement in quantum technology has initiated a new round of post-quantum cryptography (PQC) related exploration. The key encapsulation mechanism (KEM) Saber is an important module lattice-based PQC, which has been selected as one of the PQC finalists in the ongoing National Institute of Standards and Technology (NIST) standardization process. On the other hand, however, efficient hardware implementation of KEM Saber has not been well covered in the literature. In this paper, therefore, we propose a novel cyclic-row oriented processing (CROP) strategy for efficient implementation of the key arithmetic operation of KEM Saber, i.e., the polynomial multiplication. The proposed work consists of three layers of interdependent efforts: (i) first of all, we have formulated the main operation of KEM Saber into desired mathematical forms to be further developed into CROP based algorithms, i.e., the basic version and the advanced higher-speed version; (ii) then, we have followed the proposed CROP strategy to innovatively transfer the derived two algorithms into desired polynomial multiplication structures with the help of a series of algorithm-architecture co-implementation techniques; (iii) finally, detailed complexity analysis and implementation results have shown that the proposed polynomial multiplication structures have better area-time complexities than the state-of-the-art solutions. Specifically, the field-programmable gate array (FPGA) implementation results show that the proposed design, e.g., the basic version has at least less 11.2% area-delay product (ADP) than the best competing one (Cyclone V device). The proposed high-performance polynomial multipliers offer not only efficient operation for output results delivery but also possess low-complexity feature brought by CROP strategy. The outcome of this work is expected to provide useful references for further development and standardization process of KEM Saber.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125940752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7