Scalable Tail Latency Estimation for Data Center Networks

Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson
{"title":"Scalable Tail Latency Estimation for Data Center Networks","authors":"Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson","doi":"10.48550/arXiv.2205.01234","DOIUrl":null,"url":null,"abstract":"In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Symposium on Networked Systems Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2205.01234","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据中心网络的可扩展尾延迟估计
在本文中,我们考虑了如何为超大规模的数据中心网络提供流级尾部延迟性能的快速估计。网络尾部延迟通常是云应用程序性能的一个关键指标,它可能受到各种因素的影响,包括网络负载、机架间流量倾斜、流量突发、流量大小分布、超额订阅和拓扑不对称。像ns-3和omnet++这样的网络模拟器可以提供准确的答案,但是很难并行化,即使是中等规模的单个配置也需要花费数小时或数天的时间来回答what if问题。MimicNet最近的工作展示了如何使用机器学习来提高模拟性能,但代价是每个配置都有很长的训练步骤,并且对工作量和拓扑一致性的假设通常在实践中并不适用。我们通过开发一组技术来解决这一差距,这些技术可以为具有一般流量矩阵和拓扑的大规模网络提供快速的性能估计。关键步骤是将问题分解为大量并行独立的单链路仿真;我们仔细地结合这些链路级模拟,以准确估计整个网络的端到端流级性能分布。像MimicNet一样,我们利用对称来获得额外的加速,但不依赖于机器学习,所以没有训练延迟。在大型网络中,ns-3需要11到27个小时来模拟5秒的网络行为,我们的技术在1到2分钟内运行,流量完成时间的99%精度在9%以内。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Collie: Finding Performance Anomalies in RDMA Subsystems Skyplane: Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays Saiyan: Design and Implementation of a Low-power Demodulator for LoRa Backscatter Systems Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training Scalable Tail Latency Estimation for Data Center Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1