Fabsim-X: A Simulation Framework for the Analysis of Large-Scale Topologies and Congestion Control Protocols in Data Center Networks

Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen
{"title":"Fabsim-X: A Simulation Framework for the Analysis of Large-Scale Topologies and Congestion Control Protocols in Data Center Networks","authors":"Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen","doi":"10.1109/MASCOTS50786.2020.9285933","DOIUrl":null,"url":null,"abstract":"The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Fabsim-X:用于分析数据中心网络中大规模拓扑和拥塞控制协议的仿真框架
云计算和整体数据中心系统的爆炸性增长对系统架构师和设计人员产生了前所未有的需求,他们需要不断开发更复杂的系统网络,以有效地满足对处理、移动和存储大量数据的永不满足的需求。由新出现的工作负载和用例、不同的端到端拥塞协议以及定制设计的加速器的各种计算和存储功能的异质性引起的非线性系统行为进一步加剧了设计问题。现代仿真方法缺乏一个内聚和有效的框架来处理大规模的交叉层的互操作性。在本文中,我们提出了一个评估拥塞控制协议的仿真框架。此外,我们提出了一组优化,可以在更长的模拟时间和网络扩展到128K节点时进行分析,这对于需要长时间运行的工作负载(例如,AI训练)或已知有扩展问题的工作负载(例如,RDMA)的适当分析至关重要。具体来说,我们评估了各种规模的拥塞控制性能,研究了拓扑缩放对拥塞的影响,以及同时异构协议的性能影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Improving NAND flash performance with read heat separation Self-adaptive Threshold-based Policy for Microservices Elasticity Baloo: Measuring and Modeling the Performance Configurations of Distributed DBMS Evaluating the Performance of a State-of-the-Art Group-oriented Encryption Scheme for Dynamic Groups in an IoT Scenario Model-Aided Learning for URLLC Transmission in Unlicensed Spectrum
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1