NVLM:开放的前沿类多模态 LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
{"title":"NVLM:开放的前沿类多模态 LLMs","authors":"Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping","doi":"arxiv-2409.11402","DOIUrl":null,"url":null,"abstract":"We introduce NVLM 1.0, a family of frontier-class multimodal large language\nmodels (LLMs) that achieve state-of-the-art results on vision-language tasks,\nrivaling the leading proprietary models (e.g., GPT-4o) and open-access models\n(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved\ntext-only performance over its LLM backbone after multimodal training. In terms\nof model design, we perform a comprehensive comparison between decoder-only\nmultimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,\nFlamingo). Based on the strengths and weaknesses of both approaches, we propose\na novel architecture that enhances both training efficiency and multimodal\nreasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for\ntile-based dynamic high-resolution images, which significantly boosts\nperformance on multimodal reasoning and OCR-related tasks. Regarding training\ndata, we meticulously curate and provide detailed information on our multimodal\npretraining and supervised fine-tuning datasets. Our findings indicate that\ndataset quality and task diversity are more important than scale, even during\nthe pretraining phase, across all architectures. Notably, we develop\nproduction-grade multimodality for the NVLM-1.0 models, enabling them to excel\nin vision-language tasks while maintaining and even improving text-only\nperformance compared to their LLM backbones. To achieve this, we craft and\nintegrate a high-quality text-only dataset into multimodal training, alongside\na substantial amount of multimodal math and reasoning data, leading to enhanced\nmath and coding capabilities across modalities. To advance research in the\nfield, we are releasing the model weights and will open-source the code for the\ncommunity: https://nvlm-project.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NVLM: Open Frontier-Class Multimodal LLMs\",\"authors\":\"Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping\",\"doi\":\"arxiv-2409.11402\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce NVLM 1.0, a family of frontier-class multimodal large language\\nmodels (LLMs) that achieve state-of-the-art results on vision-language tasks,\\nrivaling the leading proprietary models (e.g., GPT-4o) and open-access models\\n(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved\\ntext-only performance over its LLM backbone after multimodal training. In terms\\nof model design, we perform a comprehensive comparison between decoder-only\\nmultimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,\\nFlamingo). Based on the strengths and weaknesses of both approaches, we propose\\na novel architecture that enhances both training efficiency and multimodal\\nreasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for\\ntile-based dynamic high-resolution images, which significantly boosts\\nperformance on multimodal reasoning and OCR-related tasks. Regarding training\\ndata, we meticulously curate and provide detailed information on our multimodal\\npretraining and supervised fine-tuning datasets. Our findings indicate that\\ndataset quality and task diversity are more important than scale, even during\\nthe pretraining phase, across all architectures. Notably, we develop\\nproduction-grade multimodality for the NVLM-1.0 models, enabling them to excel\\nin vision-language tasks while maintaining and even improving text-only\\nperformance compared to their LLM backbones. To achieve this, we craft and\\nintegrate a high-quality text-only dataset into multimodal training, alongside\\na substantial amount of multimodal math and reasoning data, leading to enhanced\\nmath and coding capabilities across modalities. To advance research in the\\nfield, we are releasing the model weights and will open-source the code for the\\ncommunity: https://nvlm-project.github.io/.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11402\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们介绍了前沿级多模态大型语言模型(LLM)系列 NVLM 1.0,它在视觉语言任务上取得了最先进的结果,可与领先的专有模型(如 GPT-4o)和开放存取模型(如 Llama 3-V 405B 和 InternVL 2)相媲美。值得注意的是,经过多模态训练后,NVLM 1.0 的纯文本性能比其 LLM 骨干模型有所提高。在模型设计方面,我们对纯解码器多模态 LLM(如 LLaVA)和基于交叉注意力的模型(如 Flamingo)进行了全面比较。基于这两种方法的优缺点,我们提出了一种新型架构,既提高了训练效率,又增强了多模态推理能力。此外,我们还引入了基于瓦片的一维动态高分辨率图像标记设计,从而显著提高了多模态推理和 OCR 相关任务的性能。在训练数据方面,我们精心策划并提供了多模态训练数据集和监督微调数据集的详细信息。我们的研究结果表明,在所有架构中,数据集的质量和任务多样性比规模更重要,即使在预训练阶段也是如此。值得注意的是,我们为 NVLM-1.0 模型开发了生产级多模态模型,使其能够胜任视觉-语言任务,同时保持甚至提高了纯文本性能(与 LLM 骨干模型相比)。为了实现这一目标,我们制作了一个高质量的纯文本数据集,并将其与大量多模态数学和推理数据整合到多模态训练中,从而增强了跨模态的数学和编码能力。为了推动该领域的研究,我们将发布模型权重,并为社区开源代码:https://nvlm-project.github.io/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
NVLM: Open Frontier-Class Multimodal LLMs
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1