{"title":"D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage","authors":"Chen Ding, Jian Zhou, Kai Lu, Sicen Li, Yiqin Xiong, Jiguang Wan, Ling Zhan","doi":"10.1145/3656584","DOIUrl":null,"url":null,"abstract":"<p>LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"16 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3656584","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.
由于后台压缩速度慢、工作量大,基于 LSM 的键值存储无法达到最佳性能。压缩给高速分解存储带来了严重的 CPU 和网络开销。本文进一步揭示了压缩过程中的数据密集型压缩会消耗大量 CPU 功耗。此外,多线程压缩会在高负载期间造成大量 CPU 竞争和网络流量。基于上述观察结果,我们提出了利用现代数据处理单元(DPU)进行细粒度动态压缩卸载的建议,以减轻 CPU 和网络开销。为此,我们首先定制了一个文件系统,使 DPU 能够高效访问数据。然后,我们利用 DPU 上的 Arm 内核来满足 CPU 和网络的突发需求,从而减少资源争用和数据移动。我们进一步在 DPU 上使用基于硬件的专用加速器,以加快压缩过程中的压缩速度。我们在 RocksDB 中集成了 DPU 负载压缩技术,并在实际系统中使用英伟达最新的 Bluefield-2 DPU 对其进行了评估。评估结果表明,DPU 是解决 CPU 瓶颈和减少压缩数据流量的有效解决方案。结果表明,与仅微调 CPU 的基线相比,压缩性能加快了 2.86 至 4.03 倍,系统写入和读取吞吐量分别提高了 3.2 倍和 1.4 倍,主机 CPU 竞争和网络流量也得到了有效降低。
期刊介绍:
ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.