Verification of hardware design code is crucial for the quality assurance of hardware products. Being an indispensable part of verification, localizing bugs in the hardware design code is significant for hardware development but is often regarded as a notoriously difficult and time-consuming task. Thus, automated bug localization techniques that could assist manual debugging have attracted much attention in the hardware community. However, existing approaches are hampered by the challenge of achieving both demanding bug localization accuracy and facile automation in a single method. Simulation-based methods are fully automated but have limited localization accuracy, slice-based techniques can only give an approximate range of the presence of bugs, and spectrum-based techniques can also only yield a reference value for the likelihood that a statement is buggy. Furthermore, formula-based bug localization techniques suffer from the complexity of combinatorial explosion for automated application in industrial large-scale hardware designs. In this work, we propose Kummel, a Knowledge-augmented mutation-based bug localization for hardware design code to address these limitations. Kummel achieves the unity of precise bug localization and full automation by utilizing the knowledge augmentation through mutation analysis. To evaluate the effectiveness of Kummel, we conduct large-scale experiments on 76 versions of 17 hardware projects by seven state-of-the-art bug localization techniques. The experimental results clearly show that Kummel is statistically more effective than baselines, e.g., our approach can improve the seven original methods by 64.48% on average under the RImp metric. It brings fresh insights of hardware bug localization to the community.
{"title":"Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code","authors":"Jiang Wu, Zhuo Zhang, Deheng Yang, Jianjun Xu, Jiayu He, Xiaoguang Mao","doi":"10.1145/3660526","DOIUrl":"https://doi.org/10.1145/3660526","url":null,"abstract":"<p>Verification of hardware design code is crucial for the quality assurance of hardware products. Being an indispensable part of verification, localizing bugs in the hardware design code is significant for hardware development but is often regarded as a notoriously difficult and time-consuming task. Thus, automated bug localization techniques that could assist manual debugging have attracted much attention in the hardware community. However, existing approaches are hampered by the challenge of achieving both demanding bug localization accuracy and facile automation in a single method. Simulation-based methods are fully automated but have limited localization accuracy, slice-based techniques can only give an approximate range of the presence of bugs, and spectrum-based techniques can also only yield a reference value for the likelihood that a statement is buggy. Furthermore, formula-based bug localization techniques suffer from the complexity of combinatorial explosion for automated application in industrial large-scale hardware designs. In this work, we propose Kummel, a <underline>K</underline>nowledge-a<underline>u</underline>g<underline>m</underline>ented <underline>m</underline>utation-bas<underline>e</underline>d bug loca<underline>l</underline>ization for hardware design code to address these limitations. Kummel achieves the unity of precise bug localization and full automation by utilizing the knowledge augmentation through mutation analysis. To evaluate the effectiveness of Kummel, we conduct large-scale experiments on 76 versions of 17 hardware projects by seven state-of-the-art bug localization techniques. The experimental results clearly show that Kummel is statistically more effective than baselines, e.g., our approach can improve the seven original methods by 64.48% on average under the RImp metric. It brings fresh insights of hardware bug localization to the community.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"115 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Card (NIC) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NIC to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this paper, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques, specifically proactive congestion control, on RDMA NIC. We use COER to design offloading schemes for eleven congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load MPI programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared to a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.
{"title":"COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign","authors":"Ke Wu, Dezun Dong, Weixia Xu","doi":"10.1145/3660525","DOIUrl":"https://doi.org/10.1145/3660525","url":null,"abstract":"<p>RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Card (NIC) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NIC to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this paper, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques, specifically proactive congestion control, on RDMA NIC. We use COER to design offloading schemes for eleven congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load MPI programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared to a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"20 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza
The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of Accelerated Processing Units (APUs), processors that incorporate both a central processing unit (CPU) and an integrated graphics processing unit (iGPU).
However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new intermediate address space (IAS) in both APU and CPU systems. IAS serves as a bridge between virtual address (VA) spaces and physical address (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant translation look-aside buffer (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation.
Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.
对计算能力日益增长的需求和异构计算架构的出现,推动了对创新技术的探索,以解决当前计算和内存子系统的局限性。其中一种解决方案是使用加速处理单元(APU),即同时集成了中央处理器(CPU)和集成图形处理单元(iGPU)的处理器。然而,地址转换开销会严重影响 APU 和 CPU 系统的性能,导致整体性能下降,尤其是对于高速缓存驻留的工作负载。为解决这一问题,我们建议在 APU 和 CPU 系统中引入新的中间地址空间(IAS)。IAS 是虚拟地址(VA)空间和物理地址(PA)空间之间的桥梁,可优化地址转换过程。就 APU 系统而言,我们的研究表明,在某些工作负载情况下,iGPU 会出现严重的翻译查找旁侧缓冲区 (TLB) 错失。使用 IAS,我们可以将初始地址转换分为前端和后端两个阶段,从而有效地将地址转换的瓶颈从高速缓存侧转移到内存控制器侧,事实证明这种技术对高速缓存驻留的工作负载非常有效。我们的仿真证明,与传统 CPU 系统相比,在 CPU 系统中实施 IAS 最多可将性能提高 40%。此外,我们还评估了 APU 系统的有效性,比较了基于 IAS 的系统与传统系统的性能,结果表明,采用我们提出的 IAS 实现后,APU 系统的性能最多可提高 185%。此外,我们的分析表明,90% 以上的 TLB 错失可由高速缓存过滤,在系统中采用更大的高速缓存有可能带来更大的改进。所提出的 IAS 为提高 APU 和 CPU 系统的性能提供了一种前景广阔的实用解决方案,为计算机体系结构领域的最新研究做出了贡献。
{"title":"Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads","authors":"Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza","doi":"10.1145/3659207","DOIUrl":"https://doi.org/10.1145/3659207","url":null,"abstract":"<p>The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of <i>Accelerated Processing Units</i> (APUs), processors that incorporate both a <i>central processing unit</i> (CPU) and an <i>integrated graphics processing unit</i> (iGPU). </p><p>However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new <i>intermediate address space</i> (IAS) in both APU and CPU systems. IAS serves as a bridge between <i>virtual address</i> (VA) spaces and <i>physical address</i> (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant <i>translation look-aside buffer</i> (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation. </p><p>Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140625640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ReRAM-based Processing-In-Memory (PIM) architectures have been increasingly explored to accelerate various Deep Neural Network (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog Matrix-Vector Multiplication (MVM) operations. However, since ReRAM crossbar arrays’ peripheral circuits–analog-to-digital converters (ADCs) often feature high latency and low area efficiency, AD conversion has become a performance bottleneck of in-situ analog MVMs. Moreover, since each crossbar array is tightly coupled with very limited ADCs in current ReRAM-based PIM architectures, the scarce ADC resource is often underutilized.
In this paper, we propose ReHarvest, an ADC-crossbar decoupled architecture to improve the utilization of ADC resource. Particularly, we design a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool, and thus one crossbar array can harvest much more ADCs to parallelize the AD conversion for each MVM operation. Moreover, we propose a multi-tile matrix mapping (MTMM) scheme to further improve the ADC utilization across multiple tiles by enhancing data parallelism. To support fine-grained data dispatching for the MTMM, we also design a bus-based interconnection network to multicast input vectors among multiple tiles, and thus eliminate data redundancy and potential network congestion during multicasting. Extensive experimental results show that ReHarvest can improve the ADC utilization by 3.2 ×, and achieve 3.5 × performance speedup while reducing the ReRAM resource consumption by 3.1 × on average compared with the state-of-the-art PIM architecture–FORMS.
{"title":"ReHarvest: an ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators","authors":"Jiahong Xu, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xiaokang Yang, Huize Li, Cong Liu, Fubing Mao, Yu Zhang","doi":"10.1145/3659208","DOIUrl":"https://doi.org/10.1145/3659208","url":null,"abstract":"<p>ReRAM-based <i>Processing-In-Memory</i> (PIM) architectures have been increasingly explored to accelerate various <i>Deep Neural Network</i> (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog <i>Matrix-Vector Multiplication</i> (MVM) operations. However, since ReRAM crossbar arrays’ peripheral circuits–<i>analog-to-digital converters</i> (ADCs) often feature high latency and low area efficiency, AD conversion has become a performance bottleneck of in-situ analog MVMs. Moreover, since each crossbar array is tightly coupled with very limited ADCs in current ReRAM-based PIM architectures, the scarce ADC resource is often underutilized. </p><p>In this paper, we propose ReHarvest, an ADC-crossbar decoupled architecture to improve the utilization of ADC resource. Particularly, we design a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool, and thus one crossbar array can harvest much more ADCs to parallelize the AD conversion for each MVM operation. Moreover, we propose a <i>multi-tile matrix mapping</i> (MTMM) scheme to further improve the ADC utilization across multiple tiles by enhancing data parallelism. To support fine-grained data dispatching for the MTMM, we also design a bus-based interconnection network to multicast input vectors among multiple tiles, and thus eliminate data redundancy and potential network congestion during multicasting. Extensive experimental results show that ReHarvest can improve the ADC utilization by 3.2 ×, and achieve 3.5 × performance speedup while reducing the ReRAM resource consumption by 3.1 × on average compared with the state-of-the-art PIM architecture–FORMS.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 5 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang
The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.
{"title":"An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU","authors":"Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang","doi":"10.1145/3659209","DOIUrl":"https://doi.org/10.1145/3659209","url":null,"abstract":"<p>The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"67 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140588978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.
由于后台压缩速度慢、工作量大,基于 LSM 的键值存储无法达到最佳性能。压缩给高速分解存储带来了严重的 CPU 和网络开销。本文进一步揭示了压缩过程中的数据密集型压缩会消耗大量 CPU 功耗。此外,多线程压缩会在高负载期间造成大量 CPU 竞争和网络流量。基于上述观察结果,我们提出了利用现代数据处理单元(DPU)进行细粒度动态压缩卸载的建议,以减轻 CPU 和网络开销。为此,我们首先定制了一个文件系统,使 DPU 能够高效访问数据。然后,我们利用 DPU 上的 Arm 内核来满足 CPU 和网络的突发需求,从而减少资源争用和数据移动。我们进一步在 DPU 上使用基于硬件的专用加速器,以加快压缩过程中的压缩速度。我们在 RocksDB 中集成了 DPU 负载压缩技术,并在实际系统中使用英伟达最新的 Bluefield-2 DPU 对其进行了评估。评估结果表明,DPU 是解决 CPU 瓶颈和减少压缩数据流量的有效解决方案。结果表明,与仅微调 CPU 的基线相比,压缩性能加快了 2.86 至 4.03 倍,系统写入和读取吞吐量分别提高了 3.2 倍和 1.4 倍,主机 CPU 竞争和网络流量也得到了有效降低。
{"title":"D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage","authors":"Chen Ding, Jian Zhou, Kai Lu, Sicen Li, Yiqin Xiong, Jiguang Wan, Ling Zhan","doi":"10.1145/3656584","DOIUrl":"https://doi.org/10.1145/3656584","url":null,"abstract":"<p>LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"16 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes iSwap, a new memory page swap mechanism that reduces the ineffective I/O swap operations and improves the QoS for applications with a high priority in the cloud environments. iSwap works in the OS kernel. iSwap accurately learns the reuse patterns for memory pages and makes the swap decisions accordingly to avoid ineffective operations. In the cases where memory pressure is high, iSwap compresses pages that belong to the latency-critical (LC) applications (or high-priority applications) and keeps them in main memory, avoiding I/O operations for these LC applications to ensure QoS; and iSwap evicts low-priority applications’ pages out of main memory. iSwap has a low overhead and works well for cloud applications with large memory footprints. We evaluate iSwap on Intel x86 and ARM platforms. The experimental results show that iSwap can significantly reduce ineffective swap operations (8.0% - 19.2%) and improve the QoS for LC applications (36.8% - 91.3%) in cases where memory pressure is high, compared with the latest LRU-based approach widely used in modern OSes.
{"title":"iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments","authors":"Zhuohao Wang, Lei Liu, Limin Xiao","doi":"10.1145/3653302","DOIUrl":"https://doi.org/10.1145/3653302","url":null,"abstract":"<p>This paper proposes iSwap, a new memory page swap mechanism that reduces the ineffective I/O swap operations and improves the QoS for applications with a high priority in the cloud environments. iSwap works in the OS kernel. iSwap accurately learns the reuse patterns for memory pages and makes the swap decisions accordingly to avoid ineffective operations. In the cases where memory pressure is high, iSwap compresses pages that belong to the latency-critical (LC) applications (or high-priority applications) and keeps them in main memory, avoiding I/O operations for these LC applications to ensure QoS; and iSwap evicts low-priority applications’ pages out of main memory. iSwap has a low overhead and works well for cloud applications with large memory footprints. We evaluate iSwap on Intel x86 and ARM platforms. The experimental results show that iSwap can significantly reduce ineffective swap operations (8.0% - 19.2%) and improve the QoS for LC applications (36.8% - 91.3%) in cases where memory pressure is high, compared with the latest LRU-based approach widely used in modern OSes.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"30 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model.
This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.
{"title":"ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors","authors":"Ching-Jui Lee, Tsung Tai Yeh","doi":"10.1145/3653363","DOIUrl":"https://doi.org/10.1145/3653363","url":null,"abstract":"<p>Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model. </p><p>This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"19 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data sharing mechanism, called Cross-Core Data Sharing (CCDS). CCDS employs a predictor to estimate whether or not the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, CCDS fetches the data from the L1D that contains the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30 × and 1.20 ×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37 × and 1.11 ×, respectively.
图形处理器(GPU)是各种应用领域的首选加速器,因为它们可以加速大规模并行工作负载,而且可以使用 CUDA 和 OpenCL 等通用编程框架轻松编程。每个流式多处理器(SM)都包含一个 L1 数据高速缓存(L1D),以利用数据访问的局部性。由于两个原因,L1D 错失对 GPU 来说代价高昂。首先,L1D 未命中会消耗大量能源,因为它们需要通过片上网络访问二级缓存 (L2),如果发生 L2 未命中,则需要访问片外 DRAM。其次,如果 GPU 没有足够的活动翘曲来隐藏较长的内存访问延迟,L1D 缺失会带来性能开销。我们观察到,在不同 SM 上运行的线程共享 55% 从内存读取的数据。不幸的是,由于 L1D 位于非一致性内存域,每个 SM 都会独立地将数据从 L2 或片外内存获取到其 L1D 中,即使这些数据目前可能存在于另一个 SM 的 L1D 中。我们的目标是尽可能通过其他 SM 服务 L1D 读缺失,以减少对 L2 或片外 DRAM 的昂贵访问。为此,我们提出了一种新的数据共享机制,即跨内核数据共享(CCDS)。CCDS 采用预测器来估计所需的高速缓存块是否存在于另一个 SM 中。如果预测块存在于另一个 SM 的 L1D 中,CCDS 就会从包含该块的 L1D 中获取数据。我们在一套 26 个工作负载上进行的实验表明,与基准 GPU 相比,CCDS 的平均能耗和性能分别提高了 1.30 倍和 1.20 倍。与最先进的数据共享机制相比,CCDS 的平均能耗和性能分别提高了 1.37 倍和 1.11 倍。
{"title":"Cross-Core Data Sharing for Energy-Efficient GPUs","authors":"Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari, Hyeran Jeon, Shaahin Hesaabi, Hamid Sarbazi-Azad, Onur Mutlu, Murali Annavaram, Masoud Pedram","doi":"10.1145/3653019","DOIUrl":"https://doi.org/10.1145/3653019","url":null,"abstract":"<p>Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data sharing mechanism, called <i>Cross-Core Data Sharing (CCDS)</i>. <i>CCDS</i> employs a predictor to estimate whether or not the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, <i>CCDS</i> fetches the data from the L1D that contains the block. Our experiments on a suite of 26 workloads show that <i>CCDS</i> improves average energy and performance by 1.30 × and 1.20 ×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, <i>CCDS</i> improves average energy and performance by 1.37 × and 1.11 ×, respectively.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh
The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product, can be combined with two sparse data representations, compressed sparse and bitmap formats for the matrix and vector. Although the prior accelerators adopted one of the possible designs, it is yet to be investigated which design is the best one across different hardware resources and workload characteristics. This paper first investigates the impact of design choices with respect to the algorithm and data representation. Our evaluation shows that no single design always outperforms the others across different workloads, but the two best designs (i.e. compressed sparse format and bitmap format with dot product) have complementary performance with trade-offs incurred by the matrix characteristics. Based on the analysis, this study proposes Cerberus, a triple-mode accelerator supporting two sparse operation modes in addition to the base dense mode. To allow such multi-mode operation, it proposes a prediction model based on matrix characteristics under a given hardware configuration, which statically selects the best mode for a given sparse matrix with its dimension and density information. Our experimental results show that Cerberus provides 12.1 × performance improvements from a dense-only accelerator, and 1.5 × improvements from a fixed best SpMV design.
{"title":"Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication","authors":"Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh","doi":"10.1145/3653020","DOIUrl":"https://doi.org/10.1145/3653020","url":null,"abstract":"<p>The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product, can be combined with two sparse data representations, compressed sparse and bitmap formats for the matrix and vector. Although the prior accelerators adopted one of the possible designs, it is yet to be investigated which design is the best one across different hardware resources and workload characteristics. This paper first investigates the impact of design choices with respect to the algorithm and data representation. Our evaluation shows that no single design always outperforms the others across different workloads, but the two best designs (i.e. compressed sparse format and bitmap format with dot product) have complementary performance with trade-offs incurred by the matrix characteristics. Based on the analysis, this study proposes Cerberus, a triple-mode accelerator supporting two sparse operation modes in addition to the base dense mode. To allow such multi-mode operation, it proposes a prediction model based on matrix characteristics under a given hardware configuration, which statically selects the best mode for a given sparse matrix with its dimension and density information. Our experimental results show that Cerberus provides 12.1 × performance improvements from a dense-only accelerator, and 1.5 × improvements from a fixed best SpMV design.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}