Romain Cosson;Ali Jadbabaie;Anuran Makur;Amirhossein Reisizadeh;Devavrat Shah
{"title":"Low-Rank Gradient Descent","authors":"Romain Cosson;Ali Jadbabaie;Anuran Makur;Amirhossein Reisizadeh;Devavrat Shah","doi":"10.1109/OJCSYS.2023.3315088","DOIUrl":null,"url":null,"abstract":"Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (\n<monospace>GD</monospace>\n). Our proposed \n<italic>Low-Rank Gradient Descent</i>\n (\n<monospace>LRGD</monospace>\n) algorithm finds an \n<inline-formula><tex-math>$\\epsilon$</tex-math></inline-formula>\n-approximate stationary point of a \n<inline-formula><tex-math>$p$</tex-math></inline-formula>\n-dimensional function by first identifying \n<inline-formula><tex-math>$r \\leq p$</tex-math></inline-formula>\n significant directions, and then estimating the true \n<inline-formula><tex-math>$p$</tex-math></inline-formula>\n-dimensional gradient at every iteration by computing directional derivatives only along those \n<inline-formula><tex-math>$r$</tex-math></inline-formula>\n directions. We establish that the “directional oracle complexities” of \n<monospace>LRGD</monospace>\n for strongly convex and non-convex objective functions are \n<inline-formula><tex-math>${\\mathcal {O}}(r \\log (1/\\epsilon) + rp)$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>${\\mathcal {O}}(r/\\epsilon ^{2} + rp)$</tex-math></inline-formula>\n, respectively. Therefore, when \n<inline-formula><tex-math>$r \\ll p$</tex-math></inline-formula>\n, \n<monospace>LRGD</monospace>\n provides significant improvement over the known complexities of \n<inline-formula><tex-math>${\\mathcal {O}}(p \\log (1/\\epsilon))$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>${\\mathcal {O}}(p/\\epsilon ^{2})$</tex-math></inline-formula>\n of \n<monospace>GD</monospace>\n in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, \n<monospace>LRGD</monospace>\n provides significant gains over \n<monospace>GD</monospace>\n when the data has low-rank structure, and in the absence of such structure, \n<monospace>LRGD</monospace>\n does not degrade performance compared to \n<monospace>GD</monospace>\n. This suggests that \n<monospace>LRGD</monospace>\n could be used in practice in any setting in place of \n<monospace>GD</monospace>\n.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"2 ","pages":"380-395"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/9552933/9973428/10250907.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10250907/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (
GD
). Our proposed
Low-Rank Gradient Descent
(
LRGD
) algorithm finds an
$\epsilon$
-approximate stationary point of a
$p$
-dimensional function by first identifying
$r \leq p$
significant directions, and then estimating the true
$p$
-dimensional gradient at every iteration by computing directional derivatives only along those
$r$
directions. We establish that the “directional oracle complexities” of
LRGD
for strongly convex and non-convex objective functions are
${\mathcal {O}}(r \log (1/\epsilon) + rp)$
and
${\mathcal {O}}(r/\epsilon ^{2} + rp)$
, respectively. Therefore, when
$r \ll p$
,
LRGD
provides significant improvement over the known complexities of
${\mathcal {O}}(p \log (1/\epsilon))$
and
${\mathcal {O}}(p/\epsilon ^{2})$
of
GD
in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data,
LRGD
provides significant gains over
GD
when the data has low-rank structure, and in the absence of such structure,
LRGD
does not degrade performance compared to
GD
. This suggests that
LRGD
could be used in practice in any setting in place of
GD
.