Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-07 DOI:arxiv-2408.03505

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu

{"title":"Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation","authors":"Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu","doi":"arxiv-2408.03505","DOIUrl":null,"url":null,"abstract":"Multimodal large language models (MLLMs) have extended the success of large\nlanguage models (LLMs) to multiple data types, such as image, text and audio,\nachieving significant performance in various domains, including multimodal\ntranslation, visual question answering and content generation. Nonetheless,\nexisting systems are inefficient to train MLLMs due to substantial GPU bubbles\ncaused by the heterogeneous modality models and complex data dependencies in 3D\nparallelism. This paper proposes Optimus, a distributed MLLM training system\nthat reduces end-to-end MLLM training time. Optimus is based on our principled\nanalysis that scheduling the encoder computation within the LLM bubbles can\nreduce bubbles in MLLM training. To make scheduling encoder computation\npossible for all GPUs, Optimus searches the separate parallel plans for encoder\nand LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM\nbubbles without breaking the original data dependencies in the MLLM model\narchitecture. We further decompose encoder layer computation into a series of\nkernels, and analyze the common bubble pattern of 3D parallelism to carefully\noptimize the sub-millisecond bubble scheduling, minimizing the overall training\ntime. Our experiments in a production cluster show that Optimus accelerates\nMLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs\ncompared to baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Optimus：通过气泡开发加速大规模多模态 LLM 训练

多模态大型语言模型（MLLMs）将大型语言模型（LLMs）的成功经验扩展到了图像、文本和音频等多种数据类型，在多模态翻译、视觉问题解答和内容生成等多个领域取得了显著的性能。然而，由于异构模态模型和三维并行中的复杂数据依赖性造成了大量 GPU 气泡，现有系统在训练 MLLM 时效率低下。本文提出的 Optimus 是一种分布式 MLLM 训练系统，可缩短端到端 MLLM 训练时间。Optimus 基于我们的原理分析，即在 LLM 气泡内调度编码器计算可以减少 MLLM 训练中的气泡。为了使所有 GPU 都能调度编码器计算，Optimus 分别搜索编码器和 LLM 的并行计划，并采用气泡调度算法，以便在不破坏 MLLM 模型架构中原有数据依赖关系的情况下利用 LLM 气泡。我们进一步将编码器层计算分解为一系列内核，并分析三维并行的常见气泡模式，精心优化亚毫秒级的气泡调度，最大限度地缩短了整体训练时间。我们在生产集群中进行的实验表明，与基线相比，Optimus 在 3072 个 GPU 上使用 ViT-22B 和 GPT-175B 模型将MLLM 训练加速了 20.5%-21.3% 。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844