Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route

IF 3.6 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE/ACM Transactions on Networking Pub Date : 2024-06-12 DOI:10.1109/TNET.2024.3412429

Zonghang Li;Wenjiao Feng;Weibo Cai;Hongfang Yu;Long Luo;Gang Sun;Hongyang Du;Dusit Niyato

{"title":"Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route","authors":"Zonghang Li;Wenjiao Feng;Weibo Cai;Hongfang Yu;Long Luo;Gang Sun;Hongyang Du;Dusit Niyato","doi":"10.1109/TNET.2024.3412429","DOIUrl":null,"url":null,"abstract":"Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.5~9.2 times over MXNET.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 5","pages":"4238-4253"},"PeriodicalIF":3.6000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10555207/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.5~9.2 times over MXNET.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用网络感知自适应树和辅助路由加速地理分布式机器学习

分布式机器学习在地理分布式数据分析中越来越受欢迎，有助于对分散在不同地区数据中心的数据进行协作分析。这种模式无需将敏感的原始数据集中在一个位置，但却面临着参数同步延迟过高的重大挑战，这源于带宽有限、异构和波动的广域网络的限制。先前的研究侧重于优化同步拓扑结构，从星形结构发展到树形结构。然而，这些解决方案通常依赖于规则的树状结构，缺乏适当的拓扑指标，因此改进效果有限。本文提出的 NetStorm 是一种自适应、高效的通信调度程序，旨在加快跨地理分布数据中心的参数同步。首先，它为优化多根 FAPT 同步拓扑建立了一个有效的指标。其次，开发了一个网络感知模块，用于获取网络知识，辅助拓扑决策。第三，引入多路径辅助传输机制，以增强网络感知并促进多路径传输。最后，我们设计了策略一致性协议，以保证传输策略的无缝更新。实证结果表明，NetStorm 明显优于 MXNET、MLNET 和 TSEngine 等分布式训练系统，速度是 MXNET 的 6.5~9.2 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Networking 工程技术-电信学

CiteScore

8.20

自引率

5.40%

发文量

246

审稿时长

4-8 weeks

期刊介绍： The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.

期刊最新文献

Table of Contents IEEE/ACM Transactions on Networking Information for Authors IEEE/ACM Transactions on Networking Society Information IEEE/ACM Transactions on Networking Publication Information FPCA: Parasitic Coding Authentication for UAVs by FM Signals