GRIN: GRadient-INformed MoE

arXiv - CS - Computation and Language Pub Date : 2024-09-18 DOI:arxiv-2409.12136

Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":null,"url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\nto sparse computation through expert routing, selectively activating only a\nsmall subset of expert modules. However, sparse computation challenges\ntraditional training practices, as discrete expert routing hinders standard\nbackpropagation and thus gradient-based optimization, which are the cornerstone\nof deep learning. To better pursue the scaling power of MoE, we introduce GRIN\n(GRadient-INformed MoE training), which incorporates sparse gradient estimation\nfor expert routing and configures model parallelism to avoid token dropping.\nApplying GRIN to autoregressive language modeling, we develop a top-2\n16$\\times$3.8B MoE model. Our model, with only 6.6B activated parameters,\noutperforms a 7B dense model and matches the performance of a 14B dense model\ntrained on the same data. Extensive evaluations across diverse tasks\ndemonstrate the potential of GRIN to significantly enhance MoE efficacy,\nachieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GRIN: GRadient-INformed MoE

专家混合物（MoE）模型通过专家路由进行稀疏计算，只选择性地激活一小部分专家模块，因此比密集模型更能有效扩展。然而，稀疏计算对传统的训练方法提出了挑战，因为离散专家路由会阻碍标准后向传播，从而阻碍基于梯度的优化，而梯度优化是深度学习的基石。为了更好地发挥MoE的扩展能力，我们引入了GRIN（GRadient-INformed MoE training），它将稀疏梯度估计用于专家路由，并配置模型并行性以避免标记丢弃。我们的模型只有 6.6B 个激活参数，其性能超过了 7B 的密集模型，并与在相同数据上训练的 14B 密集模型不相上下。对不同任务的广泛评估表明，GRIN 有潜力显著提高 MoE 的效率，在 MMLU、HellaSwag、HumanEval 和 MATH 上分别取得了 79.4、83.7、74.4 和 58.9 的高分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量