{"title":"部分专家检查点:稀疏专家混合模型训练的高效容错能力","authors":"Weilin Cai, Le Qin, Jiayi Huang","doi":"arxiv-2408.04307","DOIUrl":null,"url":null,"abstract":"As large language models continue to scale up, the imperative for fault\ntolerance in distributed deep learning systems intensifies, becoming a focal\narea of AI infrastructure research. Checkpoint has emerged as the predominant\nfault tolerance strategy, with extensive studies dedicated to optimizing its\nefficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model\npresents new challenges for traditional checkpoint techniques due to the\nsubstantial increase in model size, despite comparable computational demands to\ndense models. Breaking new ground in the realm of efficient fault tolerance for\nMoE model training, we introduce a novel Partial Experts Checkpoint (PEC)\nmechanism alongside a corresponding PEC fault-tolerant system. Our approach\nstrategically checkpoints a selected subset of experts, thereby significantly\nreducing the checkpoint size for MoE models to a level comparable with that of\ndense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates\nthat the proposed PEC approach facilitates a substantial 54.2% decrease in the\nsize of non-redundant checkpoint (no data-parallel duplication), without\ncompromising the final model quality. Moreover, our PEC fault-tolerant system\nachieves a 76.9% reduction in checkpoint workload per data-parallel distributed\nrank, thereby correspondingly diminishing the checkpointing time and\nfacilitating complete overlap with the training process.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training\",\"authors\":\"Weilin Cai, Le Qin, Jiayi Huang\",\"doi\":\"arxiv-2408.04307\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As large language models continue to scale up, the imperative for fault\\ntolerance in distributed deep learning systems intensifies, becoming a focal\\narea of AI infrastructure research. Checkpoint has emerged as the predominant\\nfault tolerance strategy, with extensive studies dedicated to optimizing its\\nefficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model\\npresents new challenges for traditional checkpoint techniques due to the\\nsubstantial increase in model size, despite comparable computational demands to\\ndense models. Breaking new ground in the realm of efficient fault tolerance for\\nMoE model training, we introduce a novel Partial Experts Checkpoint (PEC)\\nmechanism alongside a corresponding PEC fault-tolerant system. Our approach\\nstrategically checkpoints a selected subset of experts, thereby significantly\\nreducing the checkpoint size for MoE models to a level comparable with that of\\ndense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates\\nthat the proposed PEC approach facilitates a substantial 54.2% decrease in the\\nsize of non-redundant checkpoint (no data-parallel duplication), without\\ncompromising the final model quality. Moreover, our PEC fault-tolerant system\\nachieves a 76.9% reduction in checkpoint workload per data-parallel distributed\\nrank, thereby correspondingly diminishing the checkpointing time and\\nfacilitating complete overlap with the training process.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04307\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04307","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
As large language models continue to scale up, the imperative for fault
tolerance in distributed deep learning systems intensifies, becoming a focal
area of AI infrastructure research. Checkpoint has emerged as the predominant
fault tolerance strategy, with extensive studies dedicated to optimizing its
efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model
presents new challenges for traditional checkpoint techniques due to the
substantial increase in model size, despite comparable computational demands to
dense models. Breaking new ground in the realm of efficient fault tolerance for
MoE model training, we introduce a novel Partial Experts Checkpoint (PEC)
mechanism alongside a corresponding PEC fault-tolerant system. Our approach
strategically checkpoints a selected subset of experts, thereby significantly
reducing the checkpoint size for MoE models to a level comparable with that of
dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates
that the proposed PEC approach facilitates a substantial 54.2% decrease in the
size of non-redundant checkpoint (no data-parallel duplication), without
compromising the final model quality. Moreover, our PEC fault-tolerant system
achieves a 76.9% reduction in checkpoint workload per data-parallel distributed
rank, thereby correspondingly diminishing the checkpointing time and
facilitating complete overlap with the training process.