Dynamical Mean-Field Theory of Self-Attention Neural Networks

arXiv - PHYS - Disordered Systems and Neural Networks Pub Date : 2024-06-11 DOI:arxiv-2406.07247

Ángel Poc-López, Miguel Aguilera

{"title":"Dynamical Mean-Field Theory of Self-Attention Neural Networks","authors":"Ángel Poc-López, Miguel Aguilera","doi":"arxiv-2406.07247","DOIUrl":null,"url":null,"abstract":"Transformer-based models have demonstrated exceptional performance across\ndiverse domains, becoming the state-of-the-art solution for addressing\nsequential machine learning problems. Even though we have a general\nunderstanding of the fundamental components in the transformer architecture,\nlittle is known about how they operate or what are their expected dynamics.\nRecently, there has been an increasing interest in exploring the relationship\nbetween attention mechanisms and Hopfield networks, promising to shed light on\nthe statistical physics of transformer networks. However, to date, the\ndynamical regimes of transformer-like models have not been studied in depth. In\nthis paper, we address this gap by using methods for the study of asymmetric\nHopfield networks in nonequilibrium regimes --namely path integral methods over\ngenerating functionals, yielding dynamics governed by concurrent mean-field\nvariables. Assuming 1-bit tokens and weights, we derive analytical\napproximations for the behavior of large self-attention neural networks coupled\nto a softmax output, which become exact in the large limit size. Our findings\nreveal nontrivial dynamical phenomena, including nonequilibrium phase\ntransitions associated with chaotic bifurcations, even for very simple\nconfigurations with a few encoded features and a very short context window.\nFinally, we discuss the potential of our analytic approach to improve our\nunderstanding of the inner workings of transformer models, potentially reducing\ncomputational training costs and enhancing model interpretability.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"151 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.07247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based models have demonstrated exceptional performance across diverse domains, becoming the state-of-the-art solution for addressing sequential machine learning problems. Even though we have a general understanding of the fundamental components in the transformer architecture, little is known about how they operate or what are their expected dynamics. Recently, there has been an increasing interest in exploring the relationship between attention mechanisms and Hopfield networks, promising to shed light on the statistical physics of transformer networks. However, to date, the dynamical regimes of transformer-like models have not been studied in depth. In this paper, we address this gap by using methods for the study of asymmetric Hopfield networks in nonequilibrium regimes --namely path integral methods over generating functionals, yielding dynamics governed by concurrent mean-field variables. Assuming 1-bit tokens and weights, we derive analytical approximations for the behavior of large self-attention neural networks coupled to a softmax output, which become exact in the large limit size. Our findings reveal nontrivial dynamical phenomena, including nonequilibrium phase transitions associated with chaotic bifurcations, even for very simple configurations with a few encoded features and a very short context window. Finally, we discuss the potential of our analytic approach to improve our understanding of the inner workings of transformer models, potentially reducing computational training costs and enhancing model interpretability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自注意神经网络的动态平均场理论

基于变压器的模型在各个领域都表现出了卓越的性能，已成为解决连续机器学习问题的最先进解决方案。最近，人们对探索注意力机制与 Hopfield 网络之间关系的兴趣与日俱增，有望揭示变压器网络的统计物理学。然而，迄今为止，人们还没有深入研究过类似变压器模型的动力学机制。在本文中，我们针对这一空白，使用非平衡态下的非对称霍普菲尔德网络研究方法--即路径积分方法，在生成函数上生成由并发均值场变量支配的动力学。假设代币和权重为 1 位，我们推导出了与软最大输出耦合的大型自注意神经网络行为的分析性近似值，这些近似值在大极限尺寸下变得精确。我们的研究结果揭示了非对称的动力学现象，包括与混沌分岔相关的非平衡相位转换，即使对于只有少量编码特征和很短上下文窗口的非常简单的配置也是如此。最后，我们讨论了我们的分析方法在改善我们对变压器模型内部工作原理的理解方面的潜力，这有可能降低计算训练成本并增强模型的可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - PHYS - Disordered Systems and Neural Networks

自引率

0.00%

发文量