Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route

IF 3.6 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE/ACM Transactions on Networking Pub Date : 2024-06-12 DOI:10.1109/TNET.2024.3412429
Zonghang Li;Wenjiao Feng;Weibo Cai;Hongfang Yu;Long Luo;Gang Sun;Hongyang Du;Dusit Niyato
{"title":"Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route","authors":"Zonghang Li;Wenjiao Feng;Weibo Cai;Hongfang Yu;Long Luo;Gang Sun;Hongyang Du;Dusit Niyato","doi":"10.1109/TNET.2024.3412429","DOIUrl":null,"url":null,"abstract":"Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.5~9.2 times over MXNET.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 5","pages":"4238-4253"},"PeriodicalIF":3.6000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10555207/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.5~9.2 times over MXNET.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用网络感知自适应树和辅助路由加速地理分布式机器学习
分布式机器学习在地理分布式数据分析中越来越受欢迎,有助于对分散在不同地区数据中心的数据进行协作分析。这种模式无需将敏感的原始数据集中在一个位置,但却面临着参数同步延迟过高的重大挑战,这源于带宽有限、异构和波动的广域网络的限制。先前的研究侧重于优化同步拓扑结构,从星形结构发展到树形结构。然而,这些解决方案通常依赖于规则的树状结构,缺乏适当的拓扑指标,因此改进效果有限。本文提出的 NetStorm 是一种自适应、高效的通信调度程序,旨在加快跨地理分布数据中心的参数同步。首先,它为优化多根 FAPT 同步拓扑建立了一个有效的指标。其次,开发了一个网络感知模块,用于获取网络知识,辅助拓扑决策。第三,引入多路径辅助传输机制,以增强网络感知并促进多路径传输。最后,我们设计了策略一致性协议,以保证传输策略的无缝更新。实证结果表明,NetStorm 明显优于 MXNET、MLNET 和 TSEngine 等分布式训练系统,速度是 MXNET 的 6.5~9.2 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE/ACM Transactions on Networking
IEEE/ACM Transactions on Networking 工程技术-电信学
CiteScore
8.20
自引率
5.40%
发文量
246
审稿时长
4-8 weeks
期刊介绍: The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.
期刊最新文献
Table of Contents IEEE/ACM Transactions on Networking Information for Authors IEEE/ACM Transactions on Networking Society Information IEEE/ACM Transactions on Networking Publication Information FPCA: Parasitic Coding Authentication for UAVs by FM Signals
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1