Xin Jin, Zhen Zhang, Yunshan Jia, Yun Ma, Xuanzhe Liu
{"title":"SDCC: software-defined collective communication for distributed training","authors":"Xin Jin, Zhen Zhang, Yunshan Jia, Yun Ma, Xuanzhe Liu","doi":"10.1007/s11432-023-3894-4","DOIUrl":null,"url":null,"abstract":"<p>Communication is crucial to the performance of distributed training. Today’s solutions tightly couple the control and data planes and lack flexibility, generality, and performance. In this study, we present SDCC, a software-defined collective communication framework for distributed training. SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane. SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph. The abstraction, together with the unification, is powerful: it enables users to easily express new and existing collective communication algorithms and optimizations, simplifies the integration with different computing engines (e.g., PyTorch and TensorFlow) and network transports (e.g., Linux TCP and kernel bypass), and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph. We further demonstrate the benefits of SDCC in four use cases.</p>","PeriodicalId":21618,"journal":{"name":"Science China Information Sciences","volume":"25 1","pages":""},"PeriodicalIF":7.3000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science China Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11432-023-3894-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Communication is crucial to the performance of distributed training. Today’s solutions tightly couple the control and data planes and lack flexibility, generality, and performance. In this study, we present SDCC, a software-defined collective communication framework for distributed training. SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane. SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph. The abstraction, together with the unification, is powerful: it enables users to easily express new and existing collective communication algorithms and optimizations, simplifies the integration with different computing engines (e.g., PyTorch and TensorFlow) and network transports (e.g., Linux TCP and kernel bypass), and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph. We further demonstrate the benefits of SDCC in four use cases.
期刊介绍:
Science China Information Sciences is a dedicated journal that showcases high-quality, original research across various domains of information sciences. It encompasses Computer Science & Technologies, Control Science & Engineering, Information & Communication Engineering, Microelectronics & Solid-State Electronics, and Quantum Information, providing a platform for the dissemination of significant contributions in these fields.