Rcmp：通过 CXL 重构基于 RDMA 的内存分解

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-19 DOI:10.1145/3634916

Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu

{"title":"Rcmp：通过 CXL 重构基于 RDMA 的内存分解","authors":"Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu","doi":"10.1145/3634916","DOIUrl":null,"url":null,"abstract":"Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL\",\"authors\":\"Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu\",\"doi\":\"10.1145/3634916\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.\",\"PeriodicalId\":50920,\"journal\":{\"name\":\"ACM Transactions on Architecture and Code Optimization\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-01-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Architecture and Code Optimization\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3634916\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3634916","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

内存分解是现代数据中心的一种前景广阔的架构，它将计算和内存资源分离成独立的池，并通过超高速网络连接起来，从而提高内存利用率，降低成本，并实现计算和内存资源的弹性扩展。然而，现有的基于远程直接内存访问（RDMA）的内存分解解决方案存在高延迟和额外开销（包括页面故障和代码重构）的问题。新兴的高速缓存相干互连（如 CXL）为重构高性能内存分解提供了机会。在本文中，我们提出了基于 RDMA 和 CXL 的新型低延迟、高可扩展性内存分解系统 Rcmp。Rcmp 的显著特点是通过 CXL 提高了基于 RDMA 系统的性能，并利用 RDMA 克服了 CXL 的距离限制。为了解决 RDMA 和 CXL 在粒度、通信和性能方面不匹配的难题，Rcmp (1) 提供了基于全局页面的内存空间管理，实现了细粒度数据访问；(2) 设计了高效的通信机制，避免了通信阻塞问题；(3) 提出了热页面识别和交换策略，以减少 RDMA 通信；(4) 设计了 RDMA 优化的 RPC 框架，以加速 RDMA 传输。我们实现了 Rcmp 的原型，并通过使用微基准和运行 YCSB 基准的键值存储来评估其性能。结果表明，与基于 RDMA 的系统相比，Rcmp 的延迟降低了 5.2 倍，吞吐量提高了 3.8 倍。我们还证明，随着节点数量的增加，Rcmp 可以很好地扩展而不影响性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL

Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.

In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.