基于分布式深度学习的细粒度容错传输算法

IF 4.7 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Network and Service Management Pub Date : 2024-09-16 DOI:10.1109/TNSM.2024.3461875

Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang

{"title":"基于分布式深度学习的细粒度容错传输算法","authors":"Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang","doi":"10.1109/TNSM.2024.3461875","DOIUrl":null,"url":null,"abstract":"Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6112-6125"},"PeriodicalIF":4.7000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning\",\"authors\":\"Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang\",\"doi\":\"10.1109/TNSM.2024.3461875\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.\",\"PeriodicalId\":13423,\"journal\":{\"name\":\"IEEE Transactions on Network and Service Management\",\"volume\":\"21 6\",\"pages\":\"6112-6125\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Network and Service Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10681145/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Network and Service Management","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10681145/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在分布式深度学习（DDL）训练中，通信开销是一个重要的挑战，经常会影响训练效率。虽然现有的解决方案，如梯度压缩、计算/通信重叠和分层流调度已经被提出，但它们往往是粗粒度的，而且不足，特别是在网络拥塞的情况下。这些不了解拥塞的方法可能导致较长的流完成时间，即所谓的尾部延迟，从而延长训练时间。在本文中，我们认为丢包容忍方法可以在不牺牲训练精度的情况下减轻尾部延迟问题，并且容忍界限在不同的DDL模型层之间是不同的。我们介绍了PLOT，一种细粒度的包丢失容忍算法，它通过利用DNN模型的特定层的丢失容忍来优化通信开销。PLOT采用基于udp的传输机制进行梯度传输，解决了尾部延迟问题，并通过丢包容忍度保持训练精度。我们在小规模测试平台和大规模模拟上的评估表明，PLOT优于其他拥塞算法，有效地减少了尾部延迟和DDL训练时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning

Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Network and Service Management Computer Science-Computer Networks and Communications

CiteScore

9.30

自引率

15.10%

发文量

325

期刊介绍： IEEE Transactions on Network and Service Management will publish (online only) peerreviewed archival quality papers that advance the state-of-the-art and practical applications of network and service management. Theoretical research contributions (presenting new concepts and techniques) and applied contributions (reporting on experiences and experiments with actual systems) will be encouraged. These transactions will focus on the key technical issues related to: Management Models, Architectures and Frameworks; Service Provisioning, Reliability and Quality Assurance; Management Functions; Enabling Technologies; Information and Communication Models; Policies; Applications and Case Studies; Emerging Technologies and Standards.

期刊最新文献

Table of Contents Table of Contents Guest Editors’ Introduction: Special Issue on Robust and Resilient Future Communication Networks A Novel Adaptive Device-Free Passive Indoor Fingerprinting Localization Under Dynamic Environment HSS: A Memory-Efficient, Accurate, and Fast Network Measurement Framework in Sliding Windows