Birds of a Feather Flock Together: Scaling RDMA RPCs with Flock

Q3 Computer Science Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI:10.1145/3477132.3483576

S. Monga, Sanidhya Kashyap, Changwoo Min

{"title":"Birds of a Feather Flock Together: Scaling RDMA RPCs with Flock","authors":"S. Monga, Sanidhya Kashyap, Changwoo Min","doi":"10.1145/3477132.3483576","DOIUrl":null,"url":null,"abstract":"RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operating Systems Review (ACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3477132.3483576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 16

Abstract

RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

物以类聚:用Flock缩放RDMA rpc

支持rdma的网络由于其高吞吐量、低延迟、CPU效率和高级特性(如远程内存操作)，在数据中心部署中越来越受欢迎。然而，在高扇入、扇出非对称网络拓扑的常见设置中有效利用RDMA功能是具有挑战性的。例如，使用RDMA编程特性是以牺牲连接可伸缩性为代价的，连接可伸缩性不能随着集群大小的增加而扩展。为了解决这个问题，一些作品放弃了一些RDMA特性，只关注传统的RPC api。在这项工作中，我们努力利用RDMA的全部功能，同时无论集群大小如何扩展连接数量。我们提出Flock，一个用于RDMA网络的通信框架，它使用硬件提供可靠的连接。Flock使用部分共享模型，通过支持线程之间的连接共享，与传统的RDMA设计不同，这提供了显著的性能改进，而不是人们普遍认为的连接共享会降低性能。Flock的核心是使用连接句柄抽象实现连接多路复用;一种基于聚并的高效网络同步方法以及具有共生发送-接收调度的连接的负载控制机制，它减少了与连接共享相关的同步开销，并确保公平利用网络连接。我们展示了分布式事务处理系统和内存索引的好处，其中它的性能分别比其他RPC系统高出88%和50%，并且显著降低了中位数和尾部延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Operating Systems Review (ACM) Computer Science-Computer Networks and Communications

CiteScore

2.80

自引率

0.00%

发文量

期刊介绍： Operating Systems Review (OSR) is a publication of the ACM Special Interest Group on Operating Systems (SIGOPS), whose scope of interest includes: computer operating systems and architecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and simulation; reliability, integrity, and security of data; communications among computing processors; and computer system modeling and analysis.

期刊最新文献

Disaggregated GPU Acceleration for Serverless Applications Navigating Performance-Efficiency Tradeoffs in Serverless Computing: Deduplication to the Rescue! Using Local Cache Coherence for Disaggregated Memory Systems Make It Real: An End-to-End Implementation of A Physically Disaggregated Data Center Memory disaggregation: why now and what are the challenges