{"title":"A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning","authors":"Yuesheng Xu, Arielle Carr","doi":"arxiv-2409.09242","DOIUrl":null,"url":null,"abstract":"The increasing complexity of deep learning models and the demand for\nprocessing vast amounts of data make the utilization of large-scale distributed\nsystems for efficient training essential. These systems, however, face\nsignificant challenges such as communication overhead, hardware limitations,\nand node failure. This paper investigates various optimization techniques in\ndistributed deep learning, including Elastic Averaging SGD (EASGD) and the\nsecond-order method AdaHessian. We propose a dynamic weighting strategy to\nmitigate the problem of straggler nodes due to failure, enhancing the\nperformance and efficiency of the overall training process. We conduct\nexperiments with different numbers of workers and communication periods to\ndemonstrate improved convergence rates and test performance using our strategy.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing complexity of deep learning models and the demand for
processing vast amounts of data make the utilization of large-scale distributed
systems for efficient training essential. These systems, however, face
significant challenges such as communication overhead, hardware limitations,
and node failure. This paper investigates various optimization techniques in
distributed deep learning, including Elastic Averaging SGD (EASGD) and the
second-order method AdaHessian. We propose a dynamic weighting strategy to
mitigate the problem of straggler nodes due to failure, enhancing the
performance and efficiency of the overall training process. We conduct
experiments with different numbers of workers and communication periods to
demonstrate improved convergence rates and test performance using our strategy.