Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
{"title":"LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture","authors":"Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang","doi":"arxiv-2409.02889","DOIUrl":null,"url":null,"abstract":"Expanding the long-context capabilities of Multi-modal Large Language\nModels~(MLLMs) is crucial for video understanding, high-resolution image\nunderstanding, and multi-modal agents. This involves a series of systematic\noptimizations, including model architecture, data construction and training\nstrategy, particularly addressing challenges such as \\textit{degraded\nperformance with more images} and \\textit{high computational costs}. In this\npaper, we adapt the model architecture to a hybrid of Mamba and Transformer\nblocks, approach data construction with both temporal and spatial dependencies\namong multiple images and employ a progressive training strategy. The released\nmodel \\textbf{LongLLaVA}~(\\textbf{Long}-Context \\textbf{L}arge\n\\textbf{L}anguage \\textbf{a}nd \\textbf{V}ision \\textbf{A}ssistant) is the first\nhybrid MLLM, which achieved a better balance between efficiency and\neffectiveness. LongLLaVA not only achieves competitive results across various\nbenchmarks, but also maintains high throughput and low memory consumption.\nEspecially, it could process nearly a thousand images on a single A100 80GB\nGPU, showing promising application prospects for a wide range of tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02889","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Expanding the long-context capabilities of Multi-modal Large Language
Models~(MLLMs) is crucial for video understanding, high-resolution image
understanding, and multi-modal agents. This involves a series of systematic
optimizations, including model architecture, data construction and training
strategy, particularly addressing challenges such as \textit{degraded
performance with more images} and \textit{high computational costs}. In this
paper, we adapt the model architecture to a hybrid of Mamba and Transformer
blocks, approach data construction with both temporal and spatial dependencies
among multiple images and employ a progressive training strategy. The released
model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge
\textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first
hybrid MLLM, which achieved a better balance between efficiency and
effectiveness. LongLLaVA not only achieves competitive results across various
benchmarks, but also maintains high throughput and low memory consumption.
Especially, it could process nearly a thousand images on a single A100 80GB
GPU, showing promising application prospects for a wide range of tasks.