基于分布式深度学习的细粒度容错传输算法

IF 4.7 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Network and Service Management Pub Date : 2024-09-16 DOI:10.1109/TNSM.2024.3461875
Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang
{"title":"基于分布式深度学习的细粒度容错传输算法","authors":"Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang","doi":"10.1109/TNSM.2024.3461875","DOIUrl":null,"url":null,"abstract":"Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6112-6125"},"PeriodicalIF":4.7000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning\",\"authors\":\"Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang\",\"doi\":\"10.1109/TNSM.2024.3461875\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.\",\"PeriodicalId\":13423,\"journal\":{\"name\":\"IEEE Transactions on Network and Service Management\",\"volume\":\"21 6\",\"pages\":\"6112-6125\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Network and Service Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10681145/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Network and Service Management","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10681145/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

在分布式深度学习(DDL)训练中,通信开销是一个重要的挑战,经常会影响训练效率。虽然现有的解决方案,如梯度压缩、计算/通信重叠和分层流调度已经被提出,但它们往往是粗粒度的,而且不足,特别是在网络拥塞的情况下。这些不了解拥塞的方法可能导致较长的流完成时间,即所谓的尾部延迟,从而延长训练时间。在本文中,我们认为丢包容忍方法可以在不牺牲训练精度的情况下减轻尾部延迟问题,并且容忍界限在不同的DDL模型层之间是不同的。我们介绍了PLOT,一种细粒度的包丢失容忍算法,它通过利用DNN模型的特定层的丢失容忍来优化通信开销。PLOT采用基于udp的传输机制进行梯度传输,解决了尾部延迟问题,并通过丢包容忍度保持训练精度。我们在小规模测试平台和大规模模拟上的评估表明,PLOT优于其他拥塞算法,有效地减少了尾部延迟和DDL训练时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning
Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Network and Service Management
IEEE Transactions on Network and Service Management Computer Science-Computer Networks and Communications
CiteScore
9.30
自引率
15.10%
发文量
325
期刊介绍: IEEE Transactions on Network and Service Management will publish (online only) peerreviewed archival quality papers that advance the state-of-the-art and practical applications of network and service management. Theoretical research contributions (presenting new concepts and techniques) and applied contributions (reporting on experiences and experiments with actual systems) will be encouraged. These transactions will focus on the key technical issues related to: Management Models, Architectures and Frameworks; Service Provisioning, Reliability and Quality Assurance; Management Functions; Enabling Technologies; Information and Communication Models; Policies; Applications and Case Studies; Emerging Technologies and Standards.
期刊最新文献
Table of Contents Table of Contents Guest Editors’ Introduction: Special Issue on Robust and Resilient Future Communication Networks A Novel Adaptive Device-Free Passive Indoor Fingerprinting Localization Under Dynamic Environment HSS: A Memory-Efficient, Accurate, and Fast Network Measurement Framework in Sliding Windows
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1