Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts

Yunxin Li;Shenyuan Jiang;Baotian Hu;Longyue Wang;Wanqi Zhong;Wenhan Luo;Lin Ma;Min Zhang
{"title":"Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts","authors":"Yunxin Li;Shenyuan Jiang;Baotian Hu;Longyue Wang;Wanqi Zhong;Wenhan Luo;Lin Ma;Min Zhang","doi":"10.1109/TPAMI.2025.3532688","DOIUrl":null,"url":null,"abstract":"Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named <bold>Uni-MoE</b> that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts’ preferences, and 3) Tuning the whole Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3424-3439"},"PeriodicalIF":18.6000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10887014/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts’ preferences, and 3) Tuning the whole Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Uni-MoE:缩放统一的多模态法学硕士与混合专家
多模态大型语言模型(Multimodal Large Language Models, mllm)的最新进展强调了可扩展模型和数据对提高性能的重要性,但这通常会导致大量的计算成本。尽管专家混合(MoE)架构已被用于有效地扩展大型语言或视觉语言模型,但这些努力通常涉及较少的专家和有限的模式。为了解决这个问题,我们的工作提出了开发具有MoE架构的统一mlm的开创性尝试,称为Uni-MoE,可以处理各种模态。具体来说,它具有特定于模态的编码器和用于统一多模态表示的连接器。我们还在llm中实现了一个稀疏的MoE架构,通过模态级数据并行和专家级模型并行来实现高效的训练和推理。为了增强多专家协作和泛化能力,本文提出了一种进步式训练策略:1)使用不同跨模态数据的各种连接器进行跨模态对齐;2)使用跨模态指令数据训练特定模态的专家以激活专家的偏好;3)在混合多模态指令数据上使用低秩自适应(Low-Rank Adaptation, LoRA)对整个Uni-MoE框架进行调优。我们在一组全面的多模态数据集上评估了指令调谐的Uni-MoE。广泛的实验结果表明,Uni-MoE的主要优势是在处理混合多模态数据集时显著减少了性能偏差,同时改进了多专家协作和泛化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Spike Camera Optical Flow Estimation Based on Continuous Spike Streams. Bi-C2R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification. Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network. A Survey on Interpretability in Visual Recognition. Mitigating Negative Transfer via Reducing Environmental Disagreement.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1