{"title":"R-FAST: Robust Fully-Asynchronous Stochastic Gradient Tracking Over General Topology","authors":"Zehan Zhu;Ye Tian;Yan Huang;Jinming Xu;Shibo He","doi":"10.1109/TSIPN.2024.3444484","DOIUrl":null,"url":null,"abstract":"We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST) for distributed machine learning problems over a network of nodes, where each node performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across nodes on convergence performance and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. Moreover, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication topologies. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex problems. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms the existing well-known asynchronous algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.","PeriodicalId":56268,"journal":{"name":"IEEE Transactions on Signal and Information Processing over Networks","volume":"10 ","pages":"665-678"},"PeriodicalIF":3.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal and Information Processing over Networks","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10660468/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST) for distributed machine learning problems over a network of nodes, where each node performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across nodes on convergence performance and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. Moreover, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication topologies. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex problems. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms the existing well-known asynchronous algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.
期刊介绍:
The IEEE Transactions on Signal and Information Processing over Networks publishes high-quality papers that extend the classical notions of processing of signals defined over vector spaces (e.g. time and space) to processing of signals and information (data) defined over networks, potentially dynamically varying. In signal processing over networks, the topology of the network may define structural relationships in the data, or may constrain processing of the data. Topics include distributed algorithms for filtering, detection, estimation, adaptation and learning, model selection, data fusion, and diffusion or evolution of information over such networks, and applications of distributed signal processing.