Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
{"title":"LongLLaVA:通过混合架构将多模态 LLM 高效扩展到 1000 张图像","authors":"Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang","doi":"arxiv-2409.02889","DOIUrl":null,"url":null,"abstract":"Expanding the long-context capabilities of Multi-modal Large Language\nModels~(MLLMs) is crucial for video understanding, high-resolution image\nunderstanding, and multi-modal agents. This involves a series of systematic\noptimizations, including model architecture, data construction and training\nstrategy, particularly addressing challenges such as \\textit{degraded\nperformance with more images} and \\textit{high computational costs}. In this\npaper, we adapt the model architecture to a hybrid of Mamba and Transformer\nblocks, approach data construction with both temporal and spatial dependencies\namong multiple images and employ a progressive training strategy. The released\nmodel \\textbf{LongLLaVA}~(\\textbf{Long}-Context \\textbf{L}arge\n\\textbf{L}anguage \\textbf{a}nd \\textbf{V}ision \\textbf{A}ssistant) is the first\nhybrid MLLM, which achieved a better balance between efficiency and\neffectiveness. LongLLaVA not only achieves competitive results across various\nbenchmarks, but also maintains high throughput and low memory consumption.\nEspecially, it could process nearly a thousand images on a single A100 80GB\nGPU, showing promising application prospects for a wide range of tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture\",\"authors\":\"Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang\",\"doi\":\"arxiv-2409.02889\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Expanding the long-context capabilities of Multi-modal Large Language\\nModels~(MLLMs) is crucial for video understanding, high-resolution image\\nunderstanding, and multi-modal agents. This involves a series of systematic\\noptimizations, including model architecture, data construction and training\\nstrategy, particularly addressing challenges such as \\\\textit{degraded\\nperformance with more images} and \\\\textit{high computational costs}. In this\\npaper, we adapt the model architecture to a hybrid of Mamba and Transformer\\nblocks, approach data construction with both temporal and spatial dependencies\\namong multiple images and employ a progressive training strategy. The released\\nmodel \\\\textbf{LongLLaVA}~(\\\\textbf{Long}-Context \\\\textbf{L}arge\\n\\\\textbf{L}anguage \\\\textbf{a}nd \\\\textbf{V}ision \\\\textbf{A}ssistant) is the first\\nhybrid MLLM, which achieved a better balance between efficiency and\\neffectiveness. LongLLaVA not only achieves competitive results across various\\nbenchmarks, but also maintains high throughput and low memory consumption.\\nEspecially, it could process nearly a thousand images on a single A100 80GB\\nGPU, showing promising application prospects for a wide range of tasks.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.02889\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02889","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language
Models~(MLLMs) is crucial for video understanding, high-resolution image
understanding, and multi-modal agents. This involves a series of systematic
optimizations, including model architecture, data construction and training
strategy, particularly addressing challenges such as \textit{degraded
performance with more images} and \textit{high computational costs}. In this
paper, we adapt the model architecture to a hybrid of Mamba and Transformer
blocks, approach data construction with both temporal and spatial dependencies
among multiple images and employ a progressive training strategy. The released
model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge
\textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first
hybrid MLLM, which achieved a better balance between efficiency and
effectiveness. LongLLaVA not only achieves competitive results across various
benchmarks, but also maintains high throughput and low memory consumption.
Especially, it could process nearly a thousand images on a single A100 80GB
GPU, showing promising application prospects for a wide range of tasks.