数据异构下局部更新在分散学习中的有效性

IF 5.8 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Signal Processing Pub Date : 2025-01-24 DOI:10.1109/TSP.2025.3533208

Tongle Wu;Zhize Li;Ying Sun

{"title":"数据异构下局部更新在分散学习中的有效性","authors":"Tongle Wu;Zhize Li;Ying Sun","doi":"10.1109/TSP.2025.3533208","DOIUrl":null,"url":null,"abstract":"We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for <inline-formula><tex-math>$\\mu$</tex-math></inline-formula>-strongly convex and <inline-formula><tex-math>$L$</tex-math></inline-formula>-smooth loss functions, we proved that local DGT achieves communication complexity <inline-formula><tex-math>$\\tilde{\\mathcal{O}}\\Big{(}\\frac{L}{\\mu(K+1)}+\\frac{\\delta+{}{\\mu}}{\\mu(1-\\rho)}+\\frac{\\rho}{(1-\\rho)^{2}}\\cdot\\frac{L+\\delta}{\\mu}\\Big{)}$</tex-math></inline-formula>, where <inline-formula><tex-math>$K$</tex-math></inline-formula> is the number of additional local update, <inline-formula><tex-math>$\\rho$</tex-math></inline-formula> measures the network connectivity and <inline-formula><tex-math>$\\delta$</tex-math></inline-formula> measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing <inline-formula><tex-math>$K$</tex-math></inline-formula> can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. Customization of the result to linear models is further provided, with improved rate expression. Numerical experiments validate our theoretical results.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"73 ","pages":"751-765"},"PeriodicalIF":5.8000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Effectiveness of Local Updates for Decentralized Learning Under Data Heterogeneity\",\"authors\":\"Tongle Wu;Zhize Li;Ying Sun\",\"doi\":\"10.1109/TSP.2025.3533208\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for <inline-formula><tex-math>$\\\\mu$</tex-math></inline-formula>-strongly convex and <inline-formula><tex-math>$L$</tex-math></inline-formula>-smooth loss functions, we proved that local DGT achieves communication complexity <inline-formula><tex-math>$\\\\tilde{\\\\mathcal{O}}\\\\Big{(}\\\\frac{L}{\\\\mu(K+1)}+\\\\frac{\\\\delta+{}{\\\\mu}}{\\\\mu(1-\\\\rho)}+\\\\frac{\\\\rho}{(1-\\\\rho)^{2}}\\\\cdot\\\\frac{L+\\\\delta}{\\\\mu}\\\\Big{)}$</tex-math></inline-formula>, where <inline-formula><tex-math>$K$</tex-math></inline-formula> is the number of additional local update, <inline-formula><tex-math>$\\\\rho$</tex-math></inline-formula> measures the network connectivity and <inline-formula><tex-math>$\\\\delta$</tex-math></inline-formula> measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing <inline-formula><tex-math>$K$</tex-math></inline-formula> can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. Customization of the result to linear models is further provided, with improved rate expression. Numerical experiments validate our theoretical results.\",\"PeriodicalId\":13330,\"journal\":{\"name\":\"IEEE Transactions on Signal Processing\",\"volume\":\"73 \",\"pages\":\"751-765\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10852183/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10852183/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

我们回顾了两种基本的分散优化方法，分散梯度跟踪（DGT）和分散梯度下降（DGD），具有多个局部更新。我们考虑了两种设置，并证明结合本地更新步骤可以降低通信复杂性。具体来说，对于$\mu$ -强凸和$L$ -平滑损失函数，我们证明了局部DGT达到了通信复杂度$\tilde{\mathcal{O}}\Big{(}\frac{L}{\mu(K+1)}+\frac{\delta+{}{\mu}}{\mu(1-\rho)}+\frac{\rho}{(1-\rho)^{2}}\cdot\frac{L+\delta}{\mu}\Big{)}$，其中$K$为额外的局部更新次数，$\rho$衡量网络连通性，$\delta$衡量局部损失的二阶异质性。我们的研究结果揭示了通信和计算之间的权衡，并表明在数据异构性较低和网络连接良好的情况下，增加$K$可以有效地降低通信成本。然后，我们考虑了局部损失具有相同最小值的过参数化情况。我们证明了在Polyak-Łojasiewicz （PL）条件下，在DGD中使用局部更新，即使没有梯度校正，也可以实现精确的线性收敛，在降低通信复杂性方面可以产生与DGT相似的效果。进一步提供了将结果定制为线性模型的功能，并改进了速率表达式。数值实验验证了理论结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The Effectiveness of Local Updates for Decentralized Learning Under Data Heterogeneity

We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for

$\mu$

-strongly convex and

$L$

-smooth loss functions, we proved that local DGT achieves communication complexity

$\tilde{\mathcal{O}}\Big{(}\frac{L}{\mu(K+1)}+\frac{\delta+{}{\mu}}{\mu(1-\rho)}+\frac{\rho}{(1-\rho)^{2}}\cdot\frac{L+\delta}{\mu}\Big{)}$

, where

$K$

is the number of additional local update,

$\rho$

measures the network connectivity and

$\delta$

measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing

$K$

can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. Customization of the result to linear models is further provided, with improved rate expression. Numerical experiments validate our theoretical results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Signal Processing 工程技术-工程：电子与电气

CiteScore

11.20

自引率

9.30%

发文量

310

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.