Scalable Tail Latency Estimation for Data Center Networks

Symposium on Networked Systems Design and Implementation Pub Date : 2022-05-02 DOI:10.48550/arXiv.2205.01234

Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson

{"title":"Scalable Tail Latency Estimation for Data Center Networks","authors":"Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson","doi":"10.48550/arXiv.2205.01234","DOIUrl":null,"url":null,"abstract":"In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Symposium on Networked Systems Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2205.01234","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据中心网络的可扩展尾延迟估计

在本文中，我们考虑了如何为超大规模的数据中心网络提供流级尾部延迟性能的快速估计。网络尾部延迟通常是云应用程序性能的一个关键指标，它可能受到各种因素的影响，包括网络负载、机架间流量倾斜、流量突发、流量大小分布、超额订阅和拓扑不对称。像ns-3和omnet++这样的网络模拟器可以提供准确的答案，但是很难并行化，即使是中等规模的单个配置也需要花费数小时或数天的时间来回答what if问题。MimicNet最近的工作展示了如何使用机器学习来提高模拟性能，但代价是每个配置都有很长的训练步骤，并且对工作量和拓扑一致性的假设通常在实践中并不适用。我们通过开发一组技术来解决这一差距，这些技术可以为具有一般流量矩阵和拓扑的大规模网络提供快速的性能估计。关键步骤是将问题分解为大量并行独立的单链路仿真;我们仔细地结合这些链路级模拟，以准确估计整个网络的端到端流级性能分布。像MimicNet一样，我们利用对称来获得额外的加速，但不依赖于机器学习，所以没有训练延迟。在大型网络中，ns-3需要11到27个小时来模拟5秒的网络行为，我们的技术在1到2分钟内运行，流量完成时间的99%精度在9%以内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Symposium on Networked Systems Design and Implementation

自引率

0.00%

发文量

期刊最新文献

Collie: Finding Performance Anomalies in RDMA Subsystems Skyplane: Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays Saiyan: Design and Implementation of a Low-power Demodulator for LoRa Backscatter Systems Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training Scalable Tail Latency Estimation for Data Center Networks