{"title":"基于分布式深度学习的细粒度容错传输算法","authors":"Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang","doi":"10.1109/TNSM.2024.3461875","DOIUrl":null,"url":null,"abstract":"Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6112-6125"},"PeriodicalIF":4.7000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning\",\"authors\":\"Yifei Lu;Jingqi Li;Shuren Li;Chanying Huang\",\"doi\":\"10.1109/TNSM.2024.3461875\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.\",\"PeriodicalId\":13423,\"journal\":{\"name\":\"IEEE Transactions on Network and Service Management\",\"volume\":\"21 6\",\"pages\":\"6112-6125\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Network and Service Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10681145/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Network and Service Management","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10681145/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning
Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.
期刊介绍:
IEEE Transactions on Network and Service Management will publish (online only) peerreviewed archival quality papers that advance the state-of-the-art and practical applications of network and service management. Theoretical research contributions (presenting new concepts and techniques) and applied contributions (reporting on experiences and experiments with actual systems) will be encouraged. These transactions will focus on the key technical issues related to: Management Models, Architectures and Frameworks; Service Provisioning, Reliability and Quality Assurance; Management Functions; Enabling Technologies; Information and Communication Models; Policies; Applications and Case Studies; Emerging Technologies and Standards.