低阶梯度下降

IEEE open journal of control systems Pub Date : 2023-09-13 DOI:10.1109/OJCSYS.2023.3315088

Romain Cosson;Ali Jadbabaie;Anuran Makur;Amirhossein Reisizadeh;Devavrat Shah

{"title":"低阶梯度下降","authors":"Romain Cosson;Ali Jadbabaie;Anuran Makur;Amirhossein Reisizadeh;Devavrat Shah","doi":"10.1109/OJCSYS.2023.3315088","DOIUrl":null,"url":null,"abstract":"Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (\n<monospace>GD</monospace>\n). Our proposed \n<italic>Low-Rank Gradient Descent</i>\n (\n<monospace>LRGD</monospace>\n) algorithm finds an \n<inline-formula><tex-math>$\\epsilon$</tex-math></inline-formula>\n-approximate stationary point of a \n<inline-formula><tex-math>$p$</tex-math></inline-formula>\n-dimensional function by first identifying \n<inline-formula><tex-math>$r \\leq p$</tex-math></inline-formula>\n significant directions, and then estimating the true \n<inline-formula><tex-math>$p$</tex-math></inline-formula>\n-dimensional gradient at every iteration by computing directional derivatives only along those \n<inline-formula><tex-math>$r$</tex-math></inline-formula>\n directions. We establish that the “directional oracle complexities” of \n<monospace>LRGD</monospace>\n for strongly convex and non-convex objective functions are \n<inline-formula><tex-math>${\\mathcal {O}}(r \\log (1/\\epsilon) + rp)$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>${\\mathcal {O}}(r/\\epsilon ^{2} + rp)$</tex-math></inline-formula>\n, respectively. Therefore, when \n<inline-formula><tex-math>$r \\ll p$</tex-math></inline-formula>\n, \n<monospace>LRGD</monospace>\n provides significant improvement over the known complexities of \n<inline-formula><tex-math>${\\mathcal {O}}(p \\log (1/\\epsilon))$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>${\\mathcal {O}}(p/\\epsilon ^{2})$</tex-math></inline-formula>\n of \n<monospace>GD</monospace>\n in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, \n<monospace>LRGD</monospace>\n provides significant gains over \n<monospace>GD</monospace>\n when the data has low-rank structure, and in the absence of such structure, \n<monospace>LRGD</monospace>\n does not degrade performance compared to \n<monospace>GD</monospace>\n. This suggests that \n<monospace>LRGD</monospace>\n could be used in practice in any setting in place of \n<monospace>GD</monospace>\n.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"2 ","pages":"380-395"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/9552933/9973428/10250907.pdf","citationCount":"0","resultStr":"{\"title\":\"Low-Rank Gradient Descent\",\"authors\":\"Romain Cosson;Ali Jadbabaie;Anuran Makur;Amirhossein Reisizadeh;Devavrat Shah\",\"doi\":\"10.1109/OJCSYS.2023.3315088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (\\n<monospace>GD</monospace>\\n). Our proposed \\n<italic>Low-Rank Gradient Descent</i>\\n (\\n<monospace>LRGD</monospace>\\n) algorithm finds an \\n<inline-formula><tex-math>$\\\\epsilon$</tex-math></inline-formula>\\n-approximate stationary point of a \\n<inline-formula><tex-math>$p$</tex-math></inline-formula>\\n-dimensional function by first identifying \\n<inline-formula><tex-math>$r \\\\leq p$</tex-math></inline-formula>\\n significant directions, and then estimating the true \\n<inline-formula><tex-math>$p$</tex-math></inline-formula>\\n-dimensional gradient at every iteration by computing directional derivatives only along those \\n<inline-formula><tex-math>$r$</tex-math></inline-formula>\\n directions. We establish that the “directional oracle complexities” of \\n<monospace>LRGD</monospace>\\n for strongly convex and non-convex objective functions are \\n<inline-formula><tex-math>${\\\\mathcal {O}}(r \\\\log (1/\\\\epsilon) + rp)$</tex-math></inline-formula>\\n and \\n<inline-formula><tex-math>${\\\\mathcal {O}}(r/\\\\epsilon ^{2} + rp)$</tex-math></inline-formula>\\n, respectively. Therefore, when \\n<inline-formula><tex-math>$r \\\\ll p$</tex-math></inline-formula>\\n, \\n<monospace>LRGD</monospace>\\n provides significant improvement over the known complexities of \\n<inline-formula><tex-math>${\\\\mathcal {O}}(p \\\\log (1/\\\\epsilon))$</tex-math></inline-formula>\\n and \\n<inline-formula><tex-math>${\\\\mathcal {O}}(p/\\\\epsilon ^{2})$</tex-math></inline-formula>\\n of \\n<monospace>GD</monospace>\\n in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, \\n<monospace>LRGD</monospace>\\n provides significant gains over \\n<monospace>GD</monospace>\\n when the data has low-rank structure, and in the absence of such structure, \\n<monospace>LRGD</monospace>\\n does not degrade performance compared to \\n<monospace>GD</monospace>\\n. This suggests that \\n<monospace>LRGD</monospace>\\n could be used in practice in any setting in place of \\n<monospace>GD</monospace>\\n.\",\"PeriodicalId\":73299,\"journal\":{\"name\":\"IEEE open journal of control systems\",\"volume\":\"2 \",\"pages\":\"380-395\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/9552933/9973428/10250907.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE open journal of control systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10250907/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10250907/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近的几项实证研究表明，重要的机器学习任务，如训练深度神经网络，表现出低秩结构，其中损失函数的大部分变化仅发生在输入空间的几个方向上。在本文中，我们利用这种低秩结构来降低基于正则梯度的方法（如梯度下降（GD））的高计算成本。我们提出的低阶梯度下降（LRGD）算法通过首先识别$r\leq-p$有效方向，然后通过仅沿$r$方向计算方向导数来估计每次迭代时的真实$p$维梯度，从而找到$p$维度函数的$\epsilon$近似平稳点。我们确定了强凸和非凸目标函数的LRGD的“方向预言复杂性”分别为$｛\mathcal｛O｝｝（r\log（1/\epsilon）+rp）$和$｛\ mathcal｛O}｝（r/\epsilon^｛2｝+rp）$。因此，当$r\ll p$时，LRGD在强凸和非凸设置中分别提供了对GD的已知复杂性$｛\mathcal｛O｝｝（p\log（1/\epsilon））$和$｛\ mathcal｛O}｝（p/\epsilon^｛2｝）$的显著改进。此外，我们形式化地刻画了精确和近似低秩函数的类。从经验上讲，使用真实和合成数据，当数据具有低秩结构时，LRGD比GD提供了显著的增益，并且在没有这种结构的情况下，与GD相比，LRGD不会降低性能。这表明LRGD可以在任何情况下代替GD在实践中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Low-Rank Gradient Descent

Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent ( GD ). Our proposed Low-Rank Gradient Descent ( LRGD ) algorithm finds an

$\epsilon$

-approximate stationary point of a

$p$

-dimensional function by first identifying

$r \leq p$

significant directions, and then estimating the true

$p$

-dimensional gradient at every iteration by computing directional derivatives only along those

$r$

directions. We establish that the “directional oracle complexities” of LRGD for strongly convex and non-convex objective functions are

${\mathcal {O}}(r \log (1/\epsilon) + rp)$

and

${\mathcal {O}}(r/\epsilon ^{2} + rp)$

, respectively. Therefore, when

$r \ll p$

, LRGD provides significant improvement over the known complexities of

${\mathcal {O}}(p \log (1/\epsilon))$

and

${\mathcal {O}}(p/\epsilon ^{2})$

of GD in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, LRGD provides significant gains over GD when the data has low-rank structure, and in the absence of such structure, LRGD does not degrade performance compared to GD . This suggests that LRGD could be used in practice in any setting in place of GD .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助