Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

arXiv - PHYS - Disordered Systems and Neural Networks Pub Date : 2024-03-05 DOI:arxiv-2403.02579

Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, Surya Ganguli

{"title":"Geometric Dynamics of Signal Propagation Predict Trainability of Transformers","authors":"Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, Surya Ganguli","doi":"arxiv-2403.02579","DOIUrl":null,"url":null,"abstract":"We investigate forward signal propagation and gradient back propagation in\ndeep, randomly initialized transformers, yielding simple necessary and\nsufficient conditions on initialization hyperparameters that ensure\ntrainability of deep transformers. Our approach treats the evolution of the\nrepresentations of $n$ tokens as they propagate through the transformer layers\nin terms of a discrete time dynamical system of $n$ interacting particles. We\nderive simple update equations for the evolving geometry of this particle\nsystem, starting from a permutation symmetric simplex. Our update equations\nshow that without MLP layers, this system will collapse to a line, consistent\nwith prior work on rank collapse in transformers. However, unlike prior work,\nour evolution equations can quantitatively track particle geometry in the\nadditional presence of nonlinear MLP layers, and it reveals an order-chaos\nphase transition as a function of initialization hyperparameters, like the\nstrength of attentional and MLP residual connections and weight variances. In\nthe ordered phase the particles are attractive and collapse to a line, while in\nthe chaotic phase the particles are repulsive and converge to a regular\n$n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent\nthat governs departures from the edge of chaos in this particle system, and a\ngradient exponent that governs the rate of exponential growth or decay of\nbackpropagated gradients. We show through experiments that, remarkably, the\nfinal test loss at the end of training is well predicted just by these two\nexponents at the beginning of training, and that the simultaneous vanishing of\nthese two exponents yields a simple necessary and sufficient condition to\nachieve minimal test loss.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.02579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We investigate forward signal propagation and gradient back propagation in deep, randomly initialized transformers, yielding simple necessary and sufficient conditions on initialization hyperparameters that ensure trainability of deep transformers. Our approach treats the evolution of the representations of $n$ tokens as they propagate through the transformer layers in terms of a discrete time dynamical system of $n$ interacting particles. We derive simple update equations for the evolving geometry of this particle system, starting from a permutation symmetric simplex. Our update equations show that without MLP layers, this system will collapse to a line, consistent with prior work on rank collapse in transformers. However, unlike prior work, our evolution equations can quantitatively track particle geometry in the additional presence of nonlinear MLP layers, and it reveals an order-chaos phase transition as a function of initialization hyperparameters, like the strength of attentional and MLP residual connections and weight variances. In the ordered phase the particles are attractive and collapse to a line, while in the chaotic phase the particles are repulsive and converge to a regular $n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent that governs departures from the edge of chaos in this particle system, and a gradient exponent that governs the rate of exponential growth or decay of backpropagated gradients. We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents at the beginning of training, and that the simultaneous vanishing of these two exponents yields a simple necessary and sufficient condition to achieve minimal test loss.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

信号传播的几何动力学预测变压器的可培训性

我们在随机初始化的深度变换器中研究了前向信号传播和梯度反向传播，得出了确保深度变换器可训练性的初始化超参数的简单必要条件和充分条件。我们的方法用一个由 n 个相互作用粒子组成的离散时间动态系统来处理 n 个标记在变换器层中传播时的演化。我们从一个置换对称单纯形出发，为这个粒子系统的几何演化建立了简单的更新方程。我们的更新方程表明，如果没有 MLP 层，该系统将坍缩为一条直线，这与之前关于变压器秩坍缩的研究一致。然而，与之前的研究不同，我们的演化方程可以定量跟踪非线性 MLP 层额外存在时的粒子几何形状，它揭示了有序-混沌阶段的转变是初始化超参数的函数，如注意力和 MLP 残余连接的强度以及权重方差。在有序阶段，粒子具有吸引力并坍缩为一条线，而在混沌阶段，粒子具有排斥性并收敛为一个规则的n$复数。我们通过分析推导出两个李雅普诺夫指数：一个是控制该粒子系统偏离混沌边缘的角度指数，另一个是控制后向传播梯度指数增长或衰减速度的梯度指数。我们通过实验证明，训练结束时的最终测试损失可以通过训练开始时的这两个指数很好地预测出来，而且这两个指数的同时消失为实现最小测试损失提供了一个简单的必要条件和充分条件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - PHYS - Disordered Systems and Neural Networks

自引率

0.00%

发文量