NVLM: Open Frontier-Class Multimodal LLMs

arXiv - CS - Multimedia Pub Date : 2024-09-17 DOI:arxiv-2409.11402

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

{"title":"NVLM: Open Frontier-Class Multimodal LLMs","authors":"Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping","doi":"arxiv-2409.11402","DOIUrl":null,"url":null,"abstract":"We introduce NVLM 1.0, a family of frontier-class multimodal large language\nmodels (LLMs) that achieve state-of-the-art results on vision-language tasks,\nrivaling the leading proprietary models (e.g., GPT-4o) and open-access models\n(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved\ntext-only performance over its LLM backbone after multimodal training. In terms\nof model design, we perform a comprehensive comparison between decoder-only\nmultimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,\nFlamingo). Based on the strengths and weaknesses of both approaches, we propose\na novel architecture that enhances both training efficiency and multimodal\nreasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for\ntile-based dynamic high-resolution images, which significantly boosts\nperformance on multimodal reasoning and OCR-related tasks. Regarding training\ndata, we meticulously curate and provide detailed information on our multimodal\npretraining and supervised fine-tuning datasets. Our findings indicate that\ndataset quality and task diversity are more important than scale, even during\nthe pretraining phase, across all architectures. Notably, we develop\nproduction-grade multimodality for the NVLM-1.0 models, enabling them to excel\nin vision-language tasks while maintaining and even improving text-only\nperformance compared to their LLM backbones. To achieve this, we craft and\nintegrate a high-quality text-only dataset into multimodal training, alongside\na substantial amount of multimodal math and reasoning data, leading to enhanced\nmath and coding capabilities across modalities. To advance research in the\nfield, we are releasing the model weights and will open-source the code for the\ncommunity: https://nvlm-project.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

NVLM：开放的前沿类多模态 LLMs

我们介绍了前沿级多模态大型语言模型（LLM）系列 NVLM 1.0，它在视觉语言任务上取得了最先进的结果，可与领先的专有模型（如 GPT-4o）和开放存取模型（如 Llama 3-V 405B 和 InternVL 2）相媲美。值得注意的是，经过多模态训练后，NVLM 1.0 的纯文本性能比其 LLM 骨干模型有所提高。在模型设计方面，我们对纯解码器多模态 LLM（如 LLaVA）和基于交叉注意力的模型（如 Flamingo）进行了全面比较。基于这两种方法的优缺点，我们提出了一种新型架构，既提高了训练效率，又增强了多模态推理能力。此外，我们还引入了基于瓦片的一维动态高分辨率图像标记设计，从而显著提高了多模态推理和 OCR 相关任务的性能。在训练数据方面，我们精心策划并提供了多模态训练数据集和监督微调数据集的详细信息。我们的研究结果表明，在所有架构中，数据集的质量和任务多样性比规模更重要，即使在预训练阶段也是如此。值得注意的是，我们为 NVLM-1.0 模型开发了生产级多模态模型，使其能够胜任视觉-语言任务，同时保持甚至提高了纯文本性能（与 LLM 骨干模型相比）。为了实现这一目标，我们制作了一个高质量的纯文本数据集，并将其与大量多模态数学和推理数据整合到多模态训练中，从而增强了跨模态的数学和编码能力。为了推动该领域的研究，我们将发布模型权重，并为社区开源代码：https://nvlm-project.github.io/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs