The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion

IF 2.7 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Computation Pub Date : 2023-12-12 DOI:10.1162/neco_a_01626
Daniel Kunin;Javier Sagastuy-Brena;Lauren Gillespie;Eshed Margalit;Hidenori Tanaka;Surya Ganguli;Daniel L. K. Yamins
{"title":"The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion","authors":"Daniel Kunin;Javier Sagastuy-Brena;Lauren Gillespie;Eshed Margalit;Hidenori Tanaka;Surya Ganguli;Daniel L. K. Yamins","doi":"10.1162/neco_a_01626","DOIUrl":null,"url":null,"abstract":"In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance traveled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction among the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase-space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents that cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD. Understanding the limiting dynamics of SGD, and its dependence on various important hyperparameters like batch size, learning rate, and momentum, can serve as a basis for future work that can turn these insights into algorithmic gains.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":null,"pages":null},"PeriodicalIF":2.7000,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10535090/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance traveled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction among the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase-space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents that cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD. Understanding the limiting dynamics of SGD, and its dependence on various important hyperparameters like batch size, learning rate, and momentum, can serve as a basis for future work that can turn these insights into algorithmic gains.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SGD 的极限动力学:修正损失、相空间振荡和反常扩散。
在这项研究中,我们探索了使用随机梯度下降(SGD)训练的深度神经网络的极限动态。正如之前所观察到的,在性能收敛后的很长一段时间内,网络会通过异常扩散过程继续在参数空间中移动,在这个过程中,移动距离会随着梯度更新次数的幂律增长而增长,并具有一个非微妙的指数。我们揭示了优化的超参数、梯度噪声的结构以及训练结束时的黑森矩阵之间错综复杂的相互作用,从而解释了这种反常扩散。为了建立这种理解,我们首先推导出一个具有有限学习率和批量大小的 SGD 连续时间模型,即欠阻尼朗文方程。我们在线性回归的背景下研究这个方程,从而得出参数的相空间动态及其从初始化到静止的瞬时速度的精确解析表达式。利用福克-普朗克方程,我们证明了驱动这些动态变化的关键因素不是原始训练损失,而是修正损失(隐含正则化速度)与导致相空间振荡的概率电流的组合。我们在基于 ImageNet 训练的 ResNet-18 模型的动态过程中确定了这一理论的定性和定量预测。通过统计物理学的视角,我们发现了使用 SGD 训练的深度神经网络异常极限动力学的机理起源。了解 SGD 的极限动力学及其对批量大小、学习率和动量等各种重要超参数的依赖性,可以作为未来工作的基础,将这些见解转化为算法收益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neural Computation
Neural Computation 工程技术-计算机:人工智能
CiteScore
6.30
自引率
3.40%
发文量
83
审稿时长
3.0 months
期刊介绍: Neural Computation is uniquely positioned at the crossroads between neuroscience and TMCS and welcomes the submission of original papers from all areas of TMCS, including: Advanced experimental design; Analysis of chemical sensor data; Connectomic reconstructions; Analysis of multielectrode and optical recordings; Genetic data for cell identity; Analysis of behavioral data; Multiscale models; Analysis of molecular mechanisms; Neuroinformatics; Analysis of brain imaging data; Neuromorphic engineering; Principles of neural coding, computation, circuit dynamics, and plasticity; Theories of brain function.
期刊最新文献
Associative Learning and Active Inference. Deep Nonnegative Matrix Factorization with Beta Divergences. KLIF: An Optimized Spiking Neuron Unit for Tuning Surrogate Gradient Function. ℓ 1 -Regularized ICA: A Novel Method for Analysis of Task-Related fMRI Data. Latent Space Bayesian Optimization With Latent Data Augmentation for Enhanced Exploration.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1