Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

IF 3.2 2区数学 Q1 MATHEMATICS, APPLIED Applied and Computational Harmonic Analysis Pub Date : 2023-09-06 DOI:10.1016/j.acha.2023.101595

Hung-Hsu Chou , Carsten Gieshoff , Johannes Maly , Holger Rauhut

{"title":"Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank","authors":"Hung-Hsu Chou , Carsten Gieshoff , Johannes Maly , Holger Rauhut","doi":"10.1016/j.acha.2023.101595","DOIUrl":null,"url":null,"abstract":"<div><p>In deep learning<span>, it is common to use more network parameters than training points. In such scenario of over-parameterization, there are usually multiple networks that achieve zero training error so that the training algorithm induces an implicit bias on the computed solution. In practice, (stochastic) gradient descent tends to prefer solutions which generalize well, which provides a possible explanation of the success of deep learning. In this paper we analyze the dynamics of gradient descent in the simplified setting of linear networks and of an estimation problem. Although we are not in an overparameterized scenario, our analysis nevertheless provides insights into the phenomenon of implicit bias. In fact, we derive a rigorous analysis of the dynamics of vanilla gradient descent, and characterize the dynamical convergence of the spectrum. We are able to accurately locate time intervals where the effective rank of the iterates is close to the effective rank of a low-rank projection of the ground-truth matrix. In practice, those intervals can be used as criteria for early stopping if a certain regularity is desired. We also provide empirical evidence for implicit bias in more general scenarios, such as matrix sensing and random initialization. This suggests that deep learning prefers trajectories whose complexity (measured in terms of effective rank) is monotonically increasing, which we believe is a fundamental concept for the theoretical understanding of deep learning.</span></p></div>","PeriodicalId":55504,"journal":{"name":"Applied and Computational Harmonic Analysis","volume":"68 ","pages":"Article 101595"},"PeriodicalIF":3.2000,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied and Computational Harmonic Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1063520323000829","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

In deep learning, it is common to use more network parameters than training points. In such scenario of over-parameterization, there are usually multiple networks that achieve zero training error so that the training algorithm induces an implicit bias on the computed solution. In practice, (stochastic) gradient descent tends to prefer solutions which generalize well, which provides a possible explanation of the success of deep learning. In this paper we analyze the dynamics of gradient descent in the simplified setting of linear networks and of an estimation problem. Although we are not in an overparameterized scenario, our analysis nevertheless provides insights into the phenomenon of implicit bias. In fact, we derive a rigorous analysis of the dynamics of vanilla gradient descent, and characterize the dynamical convergence of the spectrum. We are able to accurately locate time intervals where the effective rank of the iterates is close to the effective rank of a low-rank projection of the ground-truth matrix. In practice, those intervals can be used as criteria for early stopping if a certain regularity is desired. We also provide empirical evidence for implicit bias in more general scenarios, such as matrix sensing and random initialization. This suggests that deep learning prefers trajectories whose complexity (measured in terms of effective rank) is monotonically increasing, which we believe is a fundamental concept for the theoretical understanding of deep learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

深度矩阵分解的梯度下降:对低秩的动态和隐式偏差

在深度学习中，通常使用比训练点更多的网络参数。在这种过度参数化的情况下，通常有多个网络实现零训练误差，因此训练算法对计算的解产生隐含的偏差。在实践中，（随机）梯度下降倾向于倾向于推广良好的解决方案，这为深度学习的成功提供了可能的解释。在本文中，我们分析了线性网络简化设置中的梯度下降动力学和估计问题。尽管我们没有处于一个过度参数化的场景中，但我们的分析仍然为隐性偏见现象提供了见解。事实上，我们对香草梯度下降的动力学进行了严格的分析，并刻画了谱的动力学收敛性。我们能够准确地定位迭代的有效秩接近基本真值矩阵的低秩投影的有效秩的时间间隔。在实践中，如果需要一定的规律性，这些间隔可以用作提前停止的标准。我们还为更一般的场景中的隐性偏见提供了经验证据，如矩阵感知和随机初始化。这表明，深度学习更喜欢复杂性（以有效秩衡量）单调增加的轨迹，我们认为这是深度学习理论理解的一个基本概念。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Applied and Computational Harmonic Analysis 物理-物理：数学物理

CiteScore

5.40

自引率

4.00%

发文量

审稿时长

22.9 weeks

期刊介绍： Applied and Computational Harmonic Analysis (ACHA) is an interdisciplinary journal that publishes high-quality papers in all areas of mathematical sciences related to the applied and computational aspects of harmonic analysis, with special emphasis on innovative theoretical development, methods, and algorithms, for information processing, manipulation, understanding, and so forth. The objectives of the journal are to chronicle the important publications in the rapidly growing field of data representation and analysis, to stimulate research in relevant interdisciplinary areas, and to provide a common link among mathematical, physical, and life scientists, as well as engineers.