GRIN: GRadient-INformed MoE

Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":null,"url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\nto sparse computation through expert routing, selectively activating only a\nsmall subset of expert modules. However, sparse computation challenges\ntraditional training practices, as discrete expert routing hinders standard\nbackpropagation and thus gradient-based optimization, which are the cornerstone\nof deep learning. To better pursue the scaling power of MoE, we introduce GRIN\n(GRadient-INformed MoE training), which incorporates sparse gradient estimation\nfor expert routing and configures model parallelism to avoid token dropping.\nApplying GRIN to autoregressive language modeling, we develop a top-2\n16$\\times$3.8B MoE model. Our model, with only 6.6B activated parameters,\noutperforms a 7B dense model and matches the performance of a 14B dense model\ntrained on the same data. Extensive evaluations across diverse tasks\ndemonstrate the potential of GRIN to significantly enhance MoE efficacy,\nachieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GRIN: GRadient-INformed MoE
专家混合物(MoE)模型通过专家路由进行稀疏计算,只选择性地激活一小部分专家模块,因此比密集模型更能有效扩展。然而,稀疏计算对传统的训练方法提出了挑战,因为离散专家路由会阻碍标准后向传播,从而阻碍基于梯度的优化,而梯度优化是深度学习的基石。为了更好地发挥MoE的扩展能力,我们引入了GRIN(GRadient-INformed MoE training),它将稀疏梯度估计用于专家路由,并配置模型并行性以避免标记丢弃。我们的模型只有 6.6B 个激活参数,其性能超过了 7B 的密集模型,并与在相同数据上训练的 14B 密集模型不相上下。对不同任务的广泛评估表明,GRIN 有潜力显著提高 MoE 的效率,在 MMLU、HellaSwag、HumanEval 和 MATH 上分别取得了 79.4、83.7、74.4 和 58.9 的高分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1