FastWake:重新访问中断模式RDMA的主机网络堆栈

Bojie Li, Zihao Xiang, Xiaoliang Wang, Hang Ruan, Jingbin Zhou, Kun Tan
{"title":"FastWake:重新访问中断模式RDMA的主机网络堆栈","authors":"Bojie Li, Zihao Xiang, Xiaoliang Wang, Hang Ruan, Jingbin Zhou, Kun Tan","doi":"10.1145/3600061.3600063","DOIUrl":null,"url":null,"abstract":"Polling and interrupt has long been a trade-off in RDMA systems. Polling has lower latency but each CPU core can only run one thread. Interrupt enables time sharing among multiple threads but has higher latency. Many applications such as databases have hundreds of threads, which is much larger than the number of cores. So, they have to use interrupt mode to share cores among threads, and the resulting RDMA latency is much higher than the hardware limits. In this paper, we analyze the root cause of high costs in RDMA interrupt delivery, and present FastWake, a practical redesign of interrupt-mode RDMA host network stack using commodity RDMA hardware, Linux OS, and unmodified applications. Our first approach to fast thread wake-up completely removes interrupts. We design a per-core dispatcher thread to poll all the completion queues of the application threads on the same core, and utilize a kernel fast path to context switch to the thread with an incoming completion event. The approach above would keep CPUs running at 100% utilization, so we design an interrupt-based approach for scenarios with power constraints. Observing that waking up a thread on the same core as the interrupt is much faster than threads on other cores, we dynamically adjust RDMA event queue mappings to improve interrupt core affinity. In addition, we revisit the kernel path of thread wake-up, and remove the overheads in virtual file system (VFS), locking, and process scheduling. Experiments show that FastWake can reduce RDMA latency by 80% on x86 and 77% on ARM at the cost of < 30% higher power utilization than traditional interrupts, and the latency is only 0.3 ∼ 0.4 μ s higher than the limits of underlying hardware. When power saving is desired, our interrupt-based approach can still reduce interrupt-mode RDMA latency by 59% on x86 and 52% on ARM.","PeriodicalId":228934,"journal":{"name":"Proceedings of the 7th Asia-Pacific Workshop on Networking","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FastWake: Revisiting Host Network Stack for Interrupt-mode RDMA\",\"authors\":\"Bojie Li, Zihao Xiang, Xiaoliang Wang, Hang Ruan, Jingbin Zhou, Kun Tan\",\"doi\":\"10.1145/3600061.3600063\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Polling and interrupt has long been a trade-off in RDMA systems. Polling has lower latency but each CPU core can only run one thread. Interrupt enables time sharing among multiple threads but has higher latency. Many applications such as databases have hundreds of threads, which is much larger than the number of cores. So, they have to use interrupt mode to share cores among threads, and the resulting RDMA latency is much higher than the hardware limits. In this paper, we analyze the root cause of high costs in RDMA interrupt delivery, and present FastWake, a practical redesign of interrupt-mode RDMA host network stack using commodity RDMA hardware, Linux OS, and unmodified applications. Our first approach to fast thread wake-up completely removes interrupts. We design a per-core dispatcher thread to poll all the completion queues of the application threads on the same core, and utilize a kernel fast path to context switch to the thread with an incoming completion event. The approach above would keep CPUs running at 100% utilization, so we design an interrupt-based approach for scenarios with power constraints. Observing that waking up a thread on the same core as the interrupt is much faster than threads on other cores, we dynamically adjust RDMA event queue mappings to improve interrupt core affinity. In addition, we revisit the kernel path of thread wake-up, and remove the overheads in virtual file system (VFS), locking, and process scheduling. Experiments show that FastWake can reduce RDMA latency by 80% on x86 and 77% on ARM at the cost of < 30% higher power utilization than traditional interrupts, and the latency is only 0.3 ∼ 0.4 μ s higher than the limits of underlying hardware. When power saving is desired, our interrupt-based approach can still reduce interrupt-mode RDMA latency by 59% on x86 and 52% on ARM.\",\"PeriodicalId\":228934,\"journal\":{\"name\":\"Proceedings of the 7th Asia-Pacific Workshop on Networking\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th Asia-Pacific Workshop on Networking\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3600061.3600063\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th Asia-Pacific Workshop on Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3600061.3600063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在RDMA系统中,轮询和中断一直是一种权衡。轮询具有较低的延迟,但每个CPU核心只能运行一个线程。中断可以在多个线程之间共享时间,但具有更高的延迟。许多应用程序(如数据库)有数百个线程,这比内核的数量要大得多。因此,它们必须使用中断模式在线程之间共享内核,由此产生的RDMA延迟远远高于硬件限制。在本文中,我们分析了RDMA中断交付高成本的根本原因,并提出了FastWake,这是一种使用商用RDMA硬件,Linux操作系统和未经修改的应用程序对中断模式RDMA主机网络堆栈进行重新设计的实用方法。我们的第一种快速线程唤醒方法完全消除了中断。我们设计了一个每核调度线程来轮询同一核上应用程序线程的所有完成队列,并利用内核快速路径将上下文切换到具有传入完成事件的线程。上述方法将使cpu以100%的利用率运行,因此我们为具有功率限制的场景设计了基于中断的方法。观察到唤醒与中断相同核心上的线程要比唤醒其他核心上的线程快得多,我们动态调整RDMA事件队列映射以改善中断核心的亲和性。此外,我们将重新审视线程唤醒的内核路径,并删除虚拟文件系统(VFS)、锁定和进程调度中的开销。实验表明,FastWake可以在x86上减少80%的RDMA延迟,在ARM上减少77%的RDMA延迟,而功耗比传统中断高30%,延迟仅比底层硬件的极限高0.3 ~ 0.4 μ s。当需要节省电力时,我们基于中断的方法仍然可以在x86上减少59%的中断模式RDMA延迟,在ARM上减少52%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FastWake: Revisiting Host Network Stack for Interrupt-mode RDMA
Polling and interrupt has long been a trade-off in RDMA systems. Polling has lower latency but each CPU core can only run one thread. Interrupt enables time sharing among multiple threads but has higher latency. Many applications such as databases have hundreds of threads, which is much larger than the number of cores. So, they have to use interrupt mode to share cores among threads, and the resulting RDMA latency is much higher than the hardware limits. In this paper, we analyze the root cause of high costs in RDMA interrupt delivery, and present FastWake, a practical redesign of interrupt-mode RDMA host network stack using commodity RDMA hardware, Linux OS, and unmodified applications. Our first approach to fast thread wake-up completely removes interrupts. We design a per-core dispatcher thread to poll all the completion queues of the application threads on the same core, and utilize a kernel fast path to context switch to the thread with an incoming completion event. The approach above would keep CPUs running at 100% utilization, so we design an interrupt-based approach for scenarios with power constraints. Observing that waking up a thread on the same core as the interrupt is much faster than threads on other cores, we dynamically adjust RDMA event queue mappings to improve interrupt core affinity. In addition, we revisit the kernel path of thread wake-up, and remove the overheads in virtual file system (VFS), locking, and process scheduling. Experiments show that FastWake can reduce RDMA latency by 80% on x86 and 77% on ARM at the cost of < 30% higher power utilization than traditional interrupts, and the latency is only 0.3 ∼ 0.4 μ s higher than the limits of underlying hardware. When power saving is desired, our interrupt-based approach can still reduce interrupt-mode RDMA latency by 59% on x86 and 52% on ARM.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Deadline Enables In-Order Flowlet Switching for Load Balancing Online Detection of 1D and 2D Hierarchical Super-Spreaders in High-Speed Networks ABC: Adaptive Bitrate Algorithm Commander for Multi-Client Video Streaming Bamboo: Boosting Training Efficiency for Real-Time Video Streaming via Online Grouped Federated Transfer Learning Improving Cloud Storage Network Bandwidth Utilization of Scientific Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1