Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu
{"title":"Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation","authors":"Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu","doi":"arxiv-2408.03505","DOIUrl":null,"url":null,"abstract":"Multimodal large language models (MLLMs) have extended the success of large\nlanguage models (LLMs) to multiple data types, such as image, text and audio,\nachieving significant performance in various domains, including multimodal\ntranslation, visual question answering and content generation. Nonetheless,\nexisting systems are inefficient to train MLLMs due to substantial GPU bubbles\ncaused by the heterogeneous modality models and complex data dependencies in 3D\nparallelism. This paper proposes Optimus, a distributed MLLM training system\nthat reduces end-to-end MLLM training time. Optimus is based on our principled\nanalysis that scheduling the encoder computation within the LLM bubbles can\nreduce bubbles in MLLM training. To make scheduling encoder computation\npossible for all GPUs, Optimus searches the separate parallel plans for encoder\nand LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM\nbubbles without breaking the original data dependencies in the MLLM model\narchitecture. We further decompose encoder layer computation into a series of\nkernels, and analyze the common bubble pattern of 3D parallelism to carefully\noptimize the sub-millisecond bubble scheduling, minimizing the overall training\ntime. Our experiments in a production cluster show that Optimus accelerates\nMLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs\ncompared to baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimus:通过气泡开发加速大规模多模态 LLM 训练
多模态大型语言模型(MLLMs)将大型语言模型(LLMs)的成功经验扩展到了图像、文本和音频等多种数据类型,在多模态翻译、视觉问题解答和内容生成等多个领域取得了显著的性能。然而,由于异构模态模型和三维并行中的复杂数据依赖性造成了大量 GPU 气泡,现有系统在训练 MLLM 时效率低下。本文提出的 Optimus 是一种分布式 MLLM 训练系统,可缩短端到端 MLLM 训练时间。Optimus 基于我们的原理分析,即在 LLM 气泡内调度编码器计算可以减少 MLLM 训练中的气泡。为了使所有 GPU 都能调度编码器计算,Optimus 分别搜索编码器和 LLM 的并行计划,并采用气泡调度算法,以便在不破坏 MLLM 模型架构中原有数据依赖关系的情况下利用 LLM 气泡。我们进一步将编码器层计算分解为一系列内核,并分析三维并行的常见气泡模式,精心优化亚毫秒级的气泡调度,最大限度地缩短了整体训练时间。我们在生产集群中进行的实验表明,与基线相比,Optimus 在 3072 个 GPU 上使用 ViT-22B 和 GPT-175B 模型将MLLM 训练加速了 20.5%-21.3% 。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1