ACM Transactions on Architecture and Code Optimization最新文献_第2页

Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code 基于知识增强突变的硬件设计代码错误定位

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-22 DOI: 10.1145/3660526

Jiang Wu, Zhuo Zhang, Deheng Yang, Jianjun Xu, Jiayu He, Xiaoguang Mao

Verification of hardware design code is crucial for the quality assurance of hardware products. Being an indispensable part of verification, localizing bugs in the hardware design code is significant for hardware development but is often regarded as a notoriously difficult and time-consuming task. Thus, automated bug localization techniques that could assist manual debugging have attracted much attention in the hardware community. However, existing approaches are hampered by the challenge of achieving both demanding bug localization accuracy and facile automation in a single method. Simulation-based methods are fully automated but have limited localization accuracy, slice-based techniques can only give an approximate range of the presence of bugs, and spectrum-based techniques can also only yield a reference value for the likelihood that a statement is buggy. Furthermore, formula-based bug localization techniques suffer from the complexity of combinatorial explosion for automated application in industrial large-scale hardware designs. In this work, we propose Kummel, a Knowledge-augmented mutation-based bug localization for hardware design code to address these limitations. Kummel achieves the unity of precise bug localization and full automation by utilizing the knowledge augmentation through mutation analysis. To evaluate the effectiveness of Kummel, we conduct large-scale experiments on 76 versions of 17 hardware projects by seven state-of-the-art bug localization techniques. The experimental results clearly show that Kummel is statistically more effective than baselines, e.g., our approach can improve the seven original methods by 64.48% on average under the RImp metric. It brings fresh insights of hardware bug localization to the community.

硬件设计代码的验证对于保证硬件产品的质量至关重要。作为验证不可或缺的一部分，定位硬件设计代码中的错误对硬件开发意义重大，但通常被认为是一项众所周知的艰巨而耗时的任务。因此，能够辅助人工调试的自动错误定位技术在硬件界引起了广泛关注。然而，现有的方法在实现苛刻的错误定位精度和简便的自动化方面都面临着挑战。基于仿真的方法是完全自动化的，但定位精度有限；基于切片的技术只能给出错误存在的大致范围；而基于频谱的技术也只能为语句出现错误的可能性提供一个参考值。此外，基于公式的错误定位技术还存在组合爆炸的复杂性，难以在工业化大规模硬件设计中自动应用。在这项工作中，我们提出了基于知识增强突变的硬件设计代码错误定位技术 Kummel，以解决这些局限性。Kummel 通过突变分析利用知识增强，实现了精确错误定位和完全自动化的统一。为了评估 Kummel 的有效性，我们使用七种最先进的错误定位技术对 17 个硬件项目的 76 个版本进行了大规模实验。实验结果清楚地表明，Kummel 在统计学上比基线方法更有效，例如，在 RImp 指标下，我们的方法能将七种原始方法平均提高 64.48%。它为业界带来了硬件错误定位的新见解。

{"title":"Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code","authors":"Jiang Wu, Zhuo Zhang, Deheng Yang, Jianjun Xu, Jiayu He, Xiaoguang Mao","doi":"10.1145/3660526","DOIUrl":"https://doi.org/10.1145/3660526","url":null,"abstract":"Verification of hardware design code is crucial for the quality assurance of hardware products. Being an indispensable part of verification, localizing bugs in the hardware design code is significant for hardware development but is often regarded as a notoriously difficult and time-consuming task. Thus, automated bug localization techniques that could assist manual debugging have attracted much attention in the hardware community. However, existing approaches are hampered by the challenge of achieving both demanding bug localization accuracy and facile automation in a single method. Simulation-based methods are fully automated but have limited localization accuracy, slice-based techniques can only give an approximate range of the presence of bugs, and spectrum-based techniques can also only yield a reference value for the likelihood that a statement is buggy. Furthermore, formula-based bug localization techniques suffer from the complexity of combinatorial explosion for automated application in industrial large-scale hardware designs. In this work, we propose Kummel, a <underline>K</underline>nowledge-a<underline>u</underline>g<underline>m</underline>ented <underline>m</underline>utation-bas<underline>e</underline>d bug loca<underline>l</underline>ization for hardware design code to address these limitations. Kummel achieves the unity of precise bug localization and full automation by utilizing the knowledge augmentation through mutation analysis. To evaluate the effectiveness of Kummel, we conduct large-scale experiments on 76 versions of 17 hardware projects by seven state-of-the-art bug localization techniques. The experimental results clearly show that Kummel is statistically more effective than baselines, e.g., our approach can improve the seven original methods by 64.48% on average under the RImp metric. It brings fresh insights of hardware bug localization to the community.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"115 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign COER：用于 RDMA 和拥塞控制协议代码设计的网络接口卸载架构

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-22 DOI: 10.1145/3660525

Ke Wu, Dezun Dong, Weixia Xu

RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Card (NIC) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NIC to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this paper, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques, specifically proactive congestion control, on RDMA NIC. We use COER to design offloading schemes for eleven congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load MPI programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared to a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.

RDMA（远程直接内存访问）网络需要高效的拥塞控制，以保持其高吞吐量和低延迟的特性。然而，由于主机硬件和软件之间的通信开销，在软件层部署的拥塞控制协议存在响应时间慢的问题。这种限制阻碍了它们满足高速网络和应用需求的能力。利用快速发展的网络接口卡（NIC）的功能可以推动拥塞控制的进步。一些简单的拥塞控制协议已被卸载到 RDMA 网卡，以便更快地检测和处理拥塞。然而，将拥塞控制卸载到 RDMA 网卡面临着一个重大挑战，那就是如何将 RDMA 传输协议与涉及复杂机制的高级拥塞控制协议集成在一起。我们发现，基于预留的主动拥塞控制协议与 RDMA 传输协议有很多相似之处，因此可以无缝集成并结合传输层和网络层的功能。在本文中，我们介绍了 COER，这是一种 RDMA NIC 架构，它利用 RDMA 的功能组件来执行预订，并在 RDMA 协议的调度过程中完成拥塞控制的调度。COER 有助于在 RDMA 网卡上简化拥塞控制技术（特别是主动拥塞控制）卸载策略的开发。我们利用 COER 为 11 种拥塞控制协议设计了卸载方案，并使用网络仿真器对这些方案进行了实施和评估，网络仿真器具有周期精确的 RDMA 网卡模型，可以加载 MPI 程序。评估结果表明，COER 的架构不会损害拥塞控制协议的原有特性。与分层协议栈方法相比，COER 使 RDMA 网络的性能达到了新的高度。

{"title":"COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign","authors":"Ke Wu, Dezun Dong, Weixia Xu","doi":"10.1145/3660525","DOIUrl":"https://doi.org/10.1145/3660525","url":null,"abstract":"RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Card (NIC) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NIC to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this paper, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques, specifically proactive congestion control, on RDMA NIC. We use COER to design offloading schemes for eleven congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load MPI programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared to a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"20 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads 中间地址空间：针对高速缓存驻留工作负载优化异构架构的虚拟内存

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-20 DOI: 10.1145/3659207

Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza

The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of Accelerated Processing Units (APUs), processors that incorporate both a central processing unit (CPU) and an integrated graphics processing unit (iGPU).

However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new intermediate address space (IAS) in both APU and CPU systems. IAS serves as a bridge between virtual address (VA) spaces and physical address (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant translation look-aside buffer (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation.

Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.

对计算能力日益增长的需求和异构计算架构的出现，推动了对创新技术的探索，以解决当前计算和内存子系统的局限性。其中一种解决方案是使用加速处理单元（APU），即同时集成了中央处理器（CPU）和集成图形处理单元（iGPU）的处理器。然而，地址转换开销会严重影响 APU 和 CPU 系统的性能，导致整体性能下降，尤其是对于高速缓存驻留的工作负载。为解决这一问题，我们建议在 APU 和 CPU 系统中引入新的中间地址空间（IAS）。IAS 是虚拟地址（VA）空间和物理地址（PA）空间之间的桥梁，可优化地址转换过程。就 APU 系统而言，我们的研究表明，在某些工作负载情况下，iGPU 会出现严重的翻译查找旁侧缓冲区 (TLB) 错失。使用 IAS，我们可以将初始地址转换分为前端和后端两个阶段，从而有效地将地址转换的瓶颈从高速缓存侧转移到内存控制器侧，事实证明这种技术对高速缓存驻留的工作负载非常有效。我们的仿真证明，与传统 CPU 系统相比，在 CPU 系统中实施 IAS 最多可将性能提高 40%。此外，我们还评估了 APU 系统的有效性，比较了基于 IAS 的系统与传统系统的性能，结果表明，采用我们提出的 IAS 实现后，APU 系统的性能最多可提高 185%。此外，我们的分析表明，90% 以上的 TLB 错失可由高速缓存过滤，在系统中采用更大的高速缓存有可能带来更大的改进。所提出的 IAS 为提高 APU 和 CPU 系统的性能提供了一种前景广阔的实用解决方案，为计算机体系结构领域的最新研究做出了贡献。

{"title":"Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads","authors":"Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza","doi":"10.1145/3659207","DOIUrl":"https://doi.org/10.1145/3659207","url":null,"abstract":"The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of Accelerated Processing Units (APUs), processors that incorporate both a central processing unit (CPU) and an integrated graphics processing unit (iGPU). However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new intermediate address space (IAS) in both APU and CPU systems. IAS serves as a bridge between virtual address (VA) spaces and physical address (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant translation look-aside buffer (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation. Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140625640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ReHarvest: an ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators ReHarvest：基于 ReRAM 的 DNN 加速器的 ADC 资源收集交叉条架构

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-17 DOI: 10.1145/3659208

Jiahong Xu, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xiaokang Yang, Huize Li, Cong Liu, Fubing Mao, Yu Zhang

ReRAM-based Processing-In-Memory (PIM) architectures have been increasingly explored to accelerate various Deep Neural Network (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog Matrix-Vector Multiplication (MVM) operations. However, since ReRAM crossbar arrays’ peripheral circuits–analog-to-digital converters (ADCs) often feature high latency and low area efficiency, AD conversion has become a performance bottleneck of in-situ analog MVMs. Moreover, since each crossbar array is tightly coupled with very limited ADCs in current ReRAM-based PIM architectures, the scarce ADC resource is often underutilized.

In this paper, we propose ReHarvest, an ADC-crossbar decoupled architecture to improve the utilization of ADC resource. Particularly, we design a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool, and thus one crossbar array can harvest much more ADCs to parallelize the AD conversion for each MVM operation. Moreover, we propose a multi-tile matrix mapping (MTMM) scheme to further improve the ADC utilization across multiple tiles by enhancing data parallelism. To support fine-grained data dispatching for the MTMM, we also design a bus-based interconnection network to multicast input vectors among multiple tiles, and thus eliminate data redundancy and potential network congestion during multicasting. Extensive experimental results show that ReHarvest can improve the ADC utilization by 3.2 ×, and achieve 3.5 × performance speedup while reducing the ReRAM resource consumption by 3.1 × on average compared with the state-of-the-art PIM architecture–FORMS.

基于 ReRAM 的内存处理（PIM）架构可以为原位模拟矩阵-矢量乘法（MVM）运算实现极高的性能和能效，因此越来越多地被用于加速各种深度神经网络（DNN）应用。然而，由于 ReRAM 交叉条阵列的外围电路--模数转换器 (ADC) 通常具有高延迟和低面积效率的特点，因此 AD 转换已成为原位模拟 MVM 的性能瓶颈。此外，由于在当前基于 ReRAM 的 PIM 架构中，每个横条阵列都与非常有限的 ADC 紧密耦合，因此稀缺的 ADC 资源往往得不到充分利用。在本文中，我们提出了一种 ADC-交叉条解耦架构 ReHarvest，以提高 ADC 资源的利用率。特别是，我们在交叉条和 ADC 之间设计了多对多的映射结构，以共享磁贴中的所有 ADC 作为资源池，这样一个交叉条阵列就能收获更多的 ADC，从而并行处理每个 MVM 操作的 AD 转换。此外，我们还提出了多瓦片矩阵映射（MTMM）方案，通过增强数据并行性，进一步提高多瓦片 ADC 的利用率。为了支持 MTMM 的细粒度数据调度，我们还设计了一个基于总线的互连网络，在多个瓦片之间组播输入向量，从而消除组播过程中的数据冗余和潜在网络拥塞。广泛的实验结果表明，与最先进的 PIM 架构--FORMS 相比，ReHarvest 可以将 ADC 利用率提高 3.2 倍，并实现 3.5 倍的性能加速，同时将 ReRAM 资源消耗平均减少 3.1 倍。

{"title":"ReHarvest: an ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators","authors":"Jiahong Xu, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xiaokang Yang, Huize Li, Cong Liu, Fubing Mao, Yu Zhang","doi":"10.1145/3659208","DOIUrl":"https://doi.org/10.1145/3659208","url":null,"abstract":"ReRAM-based Processing-In-Memory (PIM) architectures have been increasingly explored to accelerate various Deep Neural Network (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog Matrix-Vector Multiplication (MVM) operations. However, since ReRAM crossbar arrays’ peripheral circuits–analog-to-digital converters (ADCs) often feature high latency and low area efficiency, AD conversion has become a performance bottleneck of in-situ analog MVMs. Moreover, since each crossbar array is tightly coupled with very limited ADCs in current ReRAM-based PIM architectures, the scarce ADC resource is often underutilized. In this paper, we propose ReHarvest, an ADC-crossbar decoupled architecture to improve the utilization of ADC resource. Particularly, we design a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool, and thus one crossbar array can harvest much more ADCs to parallelize the AD conversion for each MVM operation. Moreover, we propose a multi-tile matrix mapping (MTMM) scheme to further improve the ADC utilization across multiple tiles by enhancing data parallelism. To support fine-grained data dispatching for the MTMM, we also design a bus-based interconnection network to multicast input vectors among multiple tiles, and thus eliminate data redundancy and potential network congestion during multicasting. Extensive experimental results show that ReHarvest can improve the ADC utilization by 3.2 ×, and achieve 3.5 × performance speedup while reducing the ReRAM resource consumption by 3.1 × on average compared with the state-of-the-art PIM architecture–FORMS.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 5 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU 并行梅克尔树遍历实例：GPU 上的后量子莱顿-米卡里签名

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-16 DOI: 10.1145/3659209

Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang

The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.

在众多后量子加密（PQC）算法中，基于哈希的签名（HBS）是最保守、最耗时的算法。LMS 和 XMSS 这两种 HBS 是目前唯一被美国国家标准与技术研究院（NIST）标准化的 PQC 算法。现有的 HBS 是基于串行 Merkle 树遍历设计的，不利于充分利用 CPU 和 GPU 等并行架构的计算能力。我们提出了一种并行梅克尔树遍历（PMTT），并通过在 GPU 上实现 LMS 对其进行了测试。这是首个在 GPU 上加速 LMS 的研究成果，即使在超过 10,000 个内核的情况下也能表现出色。考虑到算法并行和数据并行的不同情况，我们为 PMTT 实现了相应的变体。算法并行的 PMTT 设计主要考虑单个任务的执行效率，而数据并行的 PMTT 设计则以充分利用 GPU 性能为出发点。此外，我们还首次为遍历算法设计了CPU-GPU协同处理方案，以减少CPU和GPU之间的通信开销。在算法并行性方面，我们的实现仍然比最先进的遍历算法的理想时间快 4.48 倍。在数据并行方面，当核数从 1 增加到 8192 时，并行效率为 78.39%。相比之下，我们的 LMS 实现优于大多数现有的 LMS 和 XMSS 实现。

{"title":"An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU","authors":"Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang","doi":"10.1145/3659209","DOIUrl":"https://doi.org/10.1145/3659209","url":null,"abstract":"The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"67 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140588978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage D2Comp：在分解存储上使用数据处理单元高效卸载 LSM 树压缩

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-09 DOI: 10.1145/3656584

Chen Ding, Jian Zhou, Kai Lu, Sicen Li, Yiqin Xiong, Jiguang Wan, Ling Zhan

LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.

由于后台压缩速度慢、工作量大，基于 LSM 的键值存储无法达到最佳性能。压缩给高速分解存储带来了严重的 CPU 和网络开销。本文进一步揭示了压缩过程中的数据密集型压缩会消耗大量 CPU 功耗。此外，多线程压缩会在高负载期间造成大量 CPU 竞争和网络流量。基于上述观察结果，我们提出了利用现代数据处理单元（DPU）进行细粒度动态压缩卸载的建议，以减轻 CPU 和网络开销。为此，我们首先定制了一个文件系统，使 DPU 能够高效访问数据。然后，我们利用 DPU 上的 Arm 内核来满足 CPU 和网络的突发需求，从而减少资源争用和数据移动。我们进一步在 DPU 上使用基于硬件的专用加速器，以加快压缩过程中的压缩速度。我们在 RocksDB 中集成了 DPU 负载压缩技术，并在实际系统中使用英伟达最新的 Bluefield-2 DPU 对其进行了评估。评估结果表明，DPU 是解决 CPU 瓶颈和减少压缩数据流量的有效解决方案。结果表明，与仅微调 CPU 的基线相比，压缩性能加快了 2.86 至 4.03 倍，系统写入和读取吞吐量分别提高了 3.2 倍和 1.4 倍，主机 CPU 竞争和网络流量也得到了有效降低。

{"title":"D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage","authors":"Chen Ding, Jian Zhou, Kai Lu, Sicen Li, Yiqin Xiong, Jiguang Wan, Ling Zhan","doi":"10.1145/3656584","DOIUrl":"https://doi.org/10.1145/3656584","url":null,"abstract":"LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"16 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments iSwap：减少云环境中无效 I/O 操作的新型内存页交换机制

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-23 DOI: 10.1145/3653302

Zhuohao Wang, Lei Liu, Limin Xiao

This paper proposes iSwap, a new memory page swap mechanism that reduces the ineffective I/O swap operations and improves the QoS for applications with a high priority in the cloud environments. iSwap works in the OS kernel. iSwap accurately learns the reuse patterns for memory pages and makes the swap decisions accordingly to avoid ineffective operations. In the cases where memory pressure is high, iSwap compresses pages that belong to the latency-critical (LC) applications (or high-priority applications) and keeps them in main memory, avoiding I/O operations for these LC applications to ensure QoS; and iSwap evicts low-priority applications’ pages out of main memory. iSwap has a low overhead and works well for cloud applications with large memory footprints. We evaluate iSwap on Intel x86 and ARM platforms. The experimental results show that iSwap can significantly reduce ineffective swap operations (8.0% - 19.2%) and improve the QoS for LC applications (36.8% - 91.3%) in cases where memory pressure is high, compared with the latest LRU-based approach widely used in modern OSes.

iSwap 可在操作系统内核中工作。iSwap 可精确学习内存页的重用模式，并据此做出交换决定，以避免无效操作。在内存压力较大的情况下，iSwap 会压缩属于延迟关键型（LC）应用（或高优先级应用）的页面并将其保留在主内存中，避免这些 LC 应用的 I/O 操作，以确保 QoS；iSwap 会将低优先级应用的页面驱逐出主内存。我们在英特尔 x86 和 ARM 平台上对 iSwap 进行了评估。实验结果表明，与现代操作系统中广泛使用的基于 LRU 的最新方法相比，iSwap 可以显著减少无效交换操作（8.0% - 19.2%），并在内存压力较大的情况下提高 LC 应用程序的服务质量（36.8% - 91.3%）。

引用次数: 0

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors ReSA：用于多个微小 DNN 张量的可重构收缩阵列

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-21 DOI: 10.1145/3653363

Ching-Jui Lee, Tsung Tai Yeh

Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model.

This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.

收缩阵列架构大大加快了深度神经网络（DNN）的速度。收缩阵列由多个可执行乘积（MAC）的处理元件（PE）组成。传统上，收缩阵列可以在每个周期同时执行与收缩阵列大小相匹配的一定量的张量数据。然而，DNN 模型的超参数在每一层都不同，导致每一层的张量大小各异。将这些不规则张量映射到收缩阵列，同时充分利用收缩阵列中的整个 PE，是一项挑战。此外，现代 DNN 收缩加速器通常采用单一数据流。然而，这种数据流并不是每个 DNN 模型的最佳数据流。本研究提出的 ReSA 是一种可重新配置的数据流架构，旨在通过在空间分区的收缩阵列上映射微小的张量，最大限度地缩短 DNN 模型的执行时间。与传统的收缩阵列架构不同，ReSA 数据路径控制器可在 PE 上执行输入、权重和输出静态数据流。ReSA 还将粗粒度收缩阵列分解为多个小阵列，以减少张量映射的碎片问题。每个小的收缩子阵列单元依靠我们的数据仲裁器，通过简单的互连网络相互调度张量。此外，ReSA 还重新安排了内存访问顺序，使内存加载和执行阶段重叠，从而在处理微小张量时隐藏内存延迟。最后，ReSA 将每一层的张量分割成多个小张量，并在主机端为每个张量搜索最佳数据流。然后，ReSA 将预定义的数据流编码到我们提出的指令中，通知收缩阵列正确切换数据流。结果，在 9 个不同的 DNN 模型中，我们对收缩阵列架构的优化比权重静态收缩阵列架构的几何平均速度提高了 1.87 倍。

{"title":"ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors","authors":"Ching-Jui Lee, Tsung Tai Yeh","doi":"10.1145/3653363","DOIUrl":"https://doi.org/10.1145/3653363","url":null,"abstract":"Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model. This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"19 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Core Data Sharing for Energy-Efficient GPUs 跨核数据共享实现高能效 GPU

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-18 DOI: 10.1145/3653019

Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari, Hyeran Jeon, Shaahin Hesaabi, Hamid Sarbazi-Azad, Onur Mutlu, Murali Annavaram, Masoud Pedram

Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data sharing mechanism, called Cross-Core Data Sharing (CCDS). CCDS employs a predictor to estimate whether or not the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, CCDS fetches the data from the L1D that contains the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30 × and 1.20 ×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37 × and 1.11 ×, respectively.

图形处理器（GPU）是各种应用领域的首选加速器，因为它们可以加速大规模并行工作负载，而且可以使用 CUDA 和 OpenCL 等通用编程框架轻松编程。每个流式多处理器（SM）都包含一个 L1 数据高速缓存（L1D），以利用数据访问的局部性。由于两个原因，L1D 错失对 GPU 来说代价高昂。首先，L1D 未命中会消耗大量能源，因为它们需要通过片上网络访问二级缓存 (L2)，如果发生 L2 未命中，则需要访问片外 DRAM。其次，如果 GPU 没有足够的活动翘曲来隐藏较长的内存访问延迟，L1D 缺失会带来性能开销。我们观察到，在不同 SM 上运行的线程共享 55% 从内存读取的数据。不幸的是，由于 L1D 位于非一致性内存域，每个 SM 都会独立地将数据从 L2 或片外内存获取到其 L1D 中，即使这些数据目前可能存在于另一个 SM 的 L1D 中。我们的目标是尽可能通过其他 SM 服务 L1D 读缺失，以减少对 L2 或片外 DRAM 的昂贵访问。为此，我们提出了一种新的数据共享机制，即跨内核数据共享（CCDS）。CCDS 采用预测器来估计所需的高速缓存块是否存在于另一个 SM 中。如果预测块存在于另一个 SM 的 L1D 中，CCDS 就会从包含该块的 L1D 中获取数据。我们在一套 26 个工作负载上进行的实验表明，与基准 GPU 相比，CCDS 的平均能耗和性能分别提高了 1.30 倍和 1.20 倍。与最先进的数据共享机制相比，CCDS 的平均能耗和性能分别提高了 1.37 倍和 1.11 倍。

{"title":"Cross-Core Data Sharing for Energy-Efficient GPUs","authors":"Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari, Hyeran Jeon, Shaahin Hesaabi, Hamid Sarbazi-Azad, Onur Mutlu, Murali Annavaram, Masoud Pedram","doi":"10.1145/3653019","DOIUrl":"https://doi.org/10.1145/3653019","url":null,"abstract":"Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data sharing mechanism, called Cross-Core Data Sharing (CCDS). CCDS employs a predictor to estimate whether or not the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, CCDS fetches the data from the L1D that contains the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30 × and 1.20 ×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37 × and 1.11 ×, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication Cerberus：稀疏矩阵和矢量乘法的三重模式加速

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-17 DOI: 10.1145/3653020

Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh

The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product, can be combined with two sparse data representations, compressed sparse and bitmap formats for the matrix and vector. Although the prior accelerators adopted one of the possible designs, it is yet to be investigated which design is the best one across different hardware resources and workload characteristics. This paper first investigates the impact of design choices with respect to the algorithm and data representation. Our evaluation shows that no single design always outperforms the others across different workloads, but the two best designs (i.e. compressed sparse format and bitmap format with dot product) have complementary performance with trade-offs incurred by the matrix characteristics. Based on the analysis, this study proposes Cerberus, a triple-mode accelerator supporting two sparse operation modes in addition to the base dense mode. To allow such multi-mode operation, it proposes a prediction model based on matrix characteristics under a given hardware configuration, which statically selects the best mode for a given sparse matrix with its dimension and density information. Our experimental results show that Cerberus provides 12.1 × performance improvements from a dense-only accelerator, and 1.5 × improvements from a fixed best SpMV design.

稀疏矩阵与向量的乘法（SpMV）是高性能计算中应用最广泛的内核之一，也是稀疏神经网络的机器学习加速。SpMV 加速器的设计空间有两个轴：算法和矩阵表示。目前有两种广泛使用的算法和数据表示。标量乘法和点乘这两种算法可以与两种稀疏数据表示方式相结合，即矩阵和矢量的压缩稀疏格式和位图格式。虽然之前的加速器采用了其中一种可能的设计，但在不同的硬件资源和工作负载特征下，哪种设计是最佳的，还有待研究。本文首先研究了设计选择对算法和数据表示的影响。我们的评估结果表明，在不同的工作负载中，没有一种设计总是优于其他设计，但两种最佳设计（即压缩稀疏格式和点积位图格式）在性能上具有互补性，矩阵特性会产生权衡。基于上述分析，本研究提出了一种三重模式加速器 Cerberus，除基本密集模式外，还支持两种稀疏操作模式。为了实现这种多模式运行，它提出了一个基于给定硬件配置下矩阵特性的预测模型，该模型会根据给定稀疏矩阵的维度和密度信息，为其静态选择最佳模式。实验结果表明，Cerberus 的性能比纯密集加速器提高了 12.1 倍，比固定最佳 SpMV 设计提高了 1.5 倍。

{"title":"Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication","authors":"Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh","doi":"10.1145/3653020","DOIUrl":"https://doi.org/10.1145/3653020","url":null,"abstract":"The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product, can be combined with two sparse data representations, compressed sparse and bitmap formats for the matrix and vector. Although the prior accelerators adopted one of the possible designs, it is yet to be investigated which design is the best one across different hardware resources and workload characteristics. This paper first investigates the impact of design choices with respect to the algorithm and data representation. Our evaluation shows that no single design always outperforms the others across different workloads, but the two best designs (i.e. compressed sparse format and bitmap format with dot product) have complementary performance with trade-offs incurred by the matrix characteristics. Based on the analysis, this study proposes Cerberus, a triple-mode accelerator supporting two sparse operation modes in addition to the base dense mode. To allow such multi-mode operation, it proposes a prediction model based on matrix characteristics under a given hardware configuration, which statically selects the best mode for a given sparse matrix with its dimension and density information. Our experimental results show that Cerberus provides 12.1 × performance improvements from a dense-only accelerator, and 1.5 × improvements from a fixed best SpMV design.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0