Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-08 DOI:arxiv-2408.04307

Weilin Cai, Le Qin, Jiayi Huang

{"title":"Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training","authors":"Weilin Cai, Le Qin, Jiayi Huang","doi":"arxiv-2408.04307","DOIUrl":null,"url":null,"abstract":"As large language models continue to scale up, the imperative for fault\ntolerance in distributed deep learning systems intensifies, becoming a focal\narea of AI infrastructure research. Checkpoint has emerged as the predominant\nfault tolerance strategy, with extensive studies dedicated to optimizing its\nefficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model\npresents new challenges for traditional checkpoint techniques due to the\nsubstantial increase in model size, despite comparable computational demands to\ndense models. Breaking new ground in the realm of efficient fault tolerance for\nMoE model training, we introduce a novel Partial Experts Checkpoint (PEC)\nmechanism alongside a corresponding PEC fault-tolerant system. Our approach\nstrategically checkpoints a selected subset of experts, thereby significantly\nreducing the checkpoint size for MoE models to a level comparable with that of\ndense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates\nthat the proposed PEC approach facilitates a substantial 54.2% decrease in the\nsize of non-redundant checkpoint (no data-parallel duplication), without\ncompromising the final model quality. Moreover, our PEC fault-tolerant system\nachieves a 76.9% reduction in checkpoint workload per data-parallel distributed\nrank, thereby correspondingly diminishing the checkpointing time and\nfacilitating complete overlap with the training process.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04307","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As large language models continue to scale up, the imperative for fault tolerance in distributed deep learning systems intensifies, becoming a focal area of AI infrastructure research. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges for traditional checkpoint techniques due to the substantial increase in model size, despite comparable computational demands to dense models. Breaking new ground in the realm of efficient fault tolerance for MoE model training, we introduce a novel Partial Experts Checkpoint (PEC) mechanism alongside a corresponding PEC fault-tolerant system. Our approach strategically checkpoints a selected subset of experts, thereby significantly reducing the checkpoint size for MoE models to a level comparable with that of dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates that the proposed PEC approach facilitates a substantial 54.2% decrease in the size of non-redundant checkpoint (no data-parallel duplication), without compromising the final model quality. Moreover, our PEC fault-tolerant system achieves a 76.9% reduction in checkpoint workload per data-parallel distributed rank, thereby correspondingly diminishing the checkpointing time and facilitating complete overlap with the training process.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

部分专家检查点：稀疏专家混合模型训练的高效容错能力

随着大型语言模型的不断扩大，分布式深度学习系统的容错需求也在不断加强，这已成为人工智能基础架构研究的一个焦点领域。检查点（Checkpoint）已成为最主要的容错策略，大量研究致力于优化它的效率。然而，稀疏专家混合物（MoE）模型的出现给传统的检查点技术带来了新的挑战，因为尽管计算需求与密集模型相当，但模型规模却大幅增加。我们引入了一种新颖的部分专家检查点（PEC）机制和相应的 PEC 容错系统，在 MoE 模型训练的高效容错领域开辟了新天地。我们的方法策略性地对选定的专家子集进行检查点，从而将 MoE 模型的检查点规模显著降低到与密集模型相当的水平。对我们的 8 专家 GPT-MoE 模型进行的实证分析表明，所提出的 PEC 方法有助于将非冗余检查点（无数据并行重复）的大小大幅减少 54.2%，而不会影响最终模型的质量。此外，我们的 PEC 容错系统还将每个数据并行分布式等级的检查点工作量减少了 76.9%，从而相应地缩短了检查点时间，并促进了与训练过程的完全重叠。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844