SOAP:使用亚当改进和稳定洗发水

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade
{"title":"SOAP:使用亚当改进和稳定洗发水","authors":"Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade","doi":"arxiv-2409.11321","DOIUrl":null,"url":null,"abstract":"There is growing evidence of the effectiveness of Shampoo, a higher-order\npreconditioning method, over Adam in deep learning optimization tasks. However,\nShampoo's drawbacks include additional hyperparameters and computational\noverhead when compared to Adam, which only updates running averages of first-\nand second-moment quantities. This work establishes a formal connection between\nShampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient\napproximation of Adam -- showing that Shampoo is equivalent to running\nAdafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to\nthe design of a simpler and computationally efficient algorithm:\n$\\textbf{S}$hampo$\\textbf{O}$ with $\\textbf{A}$dam in the\n$\\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most\nstraightforward approach would be to simply compute Shampoo's\neigendecomposition less frequently. Unfortunately, as our empirical results\nshow, this leads to performance degradation that worsens with this frequency.\nSOAP mitigates this degradation by continually updating the running average of\nthe second moment, just as Adam does, but in the current (slowly changing)\ncoordinate basis. Furthermore, since SOAP is equivalent to running Adam in a\nrotated space, it introduces only one additional hyperparameter (the\npreconditioning frequency) compared to Adam. We empirically evaluate SOAP on\nlanguage model pre-training with 360m and 660m sized models. In the large batch\nregime, SOAP reduces the number of iterations by over 40% and wall clock time\nby over 35% compared to AdamW, with approximately 20% improvements in both\nmetrics compared to Shampoo. An implementation of SOAP is available at\nhttps://github.com/nikhilvyas/SOAP.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SOAP: Improving and Stabilizing Shampoo using Adam\",\"authors\":\"Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade\",\"doi\":\"arxiv-2409.11321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is growing evidence of the effectiveness of Shampoo, a higher-order\\npreconditioning method, over Adam in deep learning optimization tasks. However,\\nShampoo's drawbacks include additional hyperparameters and computational\\noverhead when compared to Adam, which only updates running averages of first-\\nand second-moment quantities. This work establishes a formal connection between\\nShampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient\\napproximation of Adam -- showing that Shampoo is equivalent to running\\nAdafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to\\nthe design of a simpler and computationally efficient algorithm:\\n$\\\\textbf{S}$hampo$\\\\textbf{O}$ with $\\\\textbf{A}$dam in the\\n$\\\\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most\\nstraightforward approach would be to simply compute Shampoo's\\neigendecomposition less frequently. Unfortunately, as our empirical results\\nshow, this leads to performance degradation that worsens with this frequency.\\nSOAP mitigates this degradation by continually updating the running average of\\nthe second moment, just as Adam does, but in the current (slowly changing)\\ncoordinate basis. Furthermore, since SOAP is equivalent to running Adam in a\\nrotated space, it introduces only one additional hyperparameter (the\\npreconditioning frequency) compared to Adam. We empirically evaluate SOAP on\\nlanguage model pre-training with 360m and 660m sized models. In the large batch\\nregime, SOAP reduces the number of iterations by over 40% and wall clock time\\nby over 35% compared to AdamW, with approximately 20% improvements in both\\nmetrics compared to Shampoo. An implementation of SOAP is available at\\nhttps://github.com/nikhilvyas/SOAP.\",\"PeriodicalId\":501301,\"journal\":{\"name\":\"arXiv - CS - Machine Learning\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11321\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11321","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

越来越多的证据表明,在深度学习优化任务中,高阶预处理方法香波比亚当更有效。然而,与亚当相比,香波的缺点包括额外的超参数和计算开销,因为亚当只更新一瞬和二瞬量的运行平均值。这项工作建立了香波(用 1/2幂实现)与 Adafactor(Adam 的内存系数近似值)之间的正式联系,表明香波等同于在香波前置条件器的特征基础上运行 Adafactor。这一洞察力促使我们设计出一种更简单、计算效率更高的算法:在$textbf{P}$预条件器的特征基础(SOAP)上,用$textbf{A}$亚当运行$textbf{S}$香波$textbf{O}$。关于提高香波的计算效率,最直接的方法就是减少香波的自分解计算频率。但是,正如我们的实证结果所显示的那样,这样做会导致性能下降,而且随着计算频率的增加,性能下降的情况会越来越严重。SOAP 通过不断更新第二时刻的运行平均值来缓解这种性能下降的情况,就像亚当所做的那样,但它是以当前(缓慢变化的)坐标为基础的。此外,由于 SOAP 等同于在旋转空间中运行 Adam,因此与 Adam 相比,它只引入了一个额外的超参数(预处理频率)。我们用 360m 和 660m 大小的模型对 SOAP 的语言模型预训练进行了实证评估。与 AdamW 相比,SOAP 减少了 40% 以上的迭代次数,减少了 35% 以上的挂钟时间;与 Shampoo 相比,SOAP 在这两项指标上都有约 20% 的改进。SOAP 的实现可在https://github.com/nikhilvyas/SOAP。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SOAP: Improving and Stabilizing Shampoo using Adam
There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features The Impact of Element Ordering on LM Agent Performance Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques Extended Deep Submodular Functions Symmetry-Enriched Learning: A Category-Theoretic Framework for Robust Machine Learning Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1