Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
{"title":"NVLM:开放的前沿类多模态 LLMs","authors":"Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping","doi":"arxiv-2409.11402","DOIUrl":null,"url":null,"abstract":"We introduce NVLM 1.0, a family of frontier-class multimodal large language\nmodels (LLMs) that achieve state-of-the-art results on vision-language tasks,\nrivaling the leading proprietary models (e.g., GPT-4o) and open-access models\n(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved\ntext-only performance over its LLM backbone after multimodal training. In terms\nof model design, we perform a comprehensive comparison between decoder-only\nmultimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,\nFlamingo). Based on the strengths and weaknesses of both approaches, we propose\na novel architecture that enhances both training efficiency and multimodal\nreasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for\ntile-based dynamic high-resolution images, which significantly boosts\nperformance on multimodal reasoning and OCR-related tasks. Regarding training\ndata, we meticulously curate and provide detailed information on our multimodal\npretraining and supervised fine-tuning datasets. Our findings indicate that\ndataset quality and task diversity are more important than scale, even during\nthe pretraining phase, across all architectures. Notably, we develop\nproduction-grade multimodality for the NVLM-1.0 models, enabling them to excel\nin vision-language tasks while maintaining and even improving text-only\nperformance compared to their LLM backbones. To achieve this, we craft and\nintegrate a high-quality text-only dataset into multimodal training, alongside\na substantial amount of multimodal math and reasoning data, leading to enhanced\nmath and coding capabilities across modalities. To advance research in the\nfield, we are releasing the model weights and will open-source the code for the\ncommunity: https://nvlm-project.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NVLM: Open Frontier-Class Multimodal LLMs\",\"authors\":\"Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping\",\"doi\":\"arxiv-2409.11402\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce NVLM 1.0, a family of frontier-class multimodal large language\\nmodels (LLMs) that achieve state-of-the-art results on vision-language tasks,\\nrivaling the leading proprietary models (e.g., GPT-4o) and open-access models\\n(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved\\ntext-only performance over its LLM backbone after multimodal training. In terms\\nof model design, we perform a comprehensive comparison between decoder-only\\nmultimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,\\nFlamingo). Based on the strengths and weaknesses of both approaches, we propose\\na novel architecture that enhances both training efficiency and multimodal\\nreasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for\\ntile-based dynamic high-resolution images, which significantly boosts\\nperformance on multimodal reasoning and OCR-related tasks. Regarding training\\ndata, we meticulously curate and provide detailed information on our multimodal\\npretraining and supervised fine-tuning datasets. Our findings indicate that\\ndataset quality and task diversity are more important than scale, even during\\nthe pretraining phase, across all architectures. Notably, we develop\\nproduction-grade multimodality for the NVLM-1.0 models, enabling them to excel\\nin vision-language tasks while maintaining and even improving text-only\\nperformance compared to their LLM backbones. To achieve this, we craft and\\nintegrate a high-quality text-only dataset into multimodal training, alongside\\na substantial amount of multimodal math and reasoning data, leading to enhanced\\nmath and coding capabilities across modalities. To advance research in the\\nfield, we are releasing the model weights and will open-source the code for the\\ncommunity: https://nvlm-project.github.io/.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11402\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
We introduce NVLM 1.0, a family of frontier-class multimodal large language
models (LLMs) that achieve state-of-the-art results on vision-language tasks,
rivaling the leading proprietary models (e.g., GPT-4o) and open-access models
(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved
text-only performance over its LLM backbone after multimodal training. In terms
of model design, we perform a comprehensive comparison between decoder-only
multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,
Flamingo). Based on the strengths and weaknesses of both approaches, we propose
a novel architecture that enhances both training efficiency and multimodal
reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for
tile-based dynamic high-resolution images, which significantly boosts
performance on multimodal reasoning and OCR-related tasks. Regarding training
data, we meticulously curate and provide detailed information on our multimodal
pretraining and supervised fine-tuning datasets. Our findings indicate that
dataset quality and task diversity are more important than scale, even during
the pretraining phase, across all architectures. Notably, we develop
production-grade multimodality for the NVLM-1.0 models, enabling them to excel
in vision-language tasks while maintaining and even improving text-only
performance compared to their LLM backbones. To achieve this, we craft and
integrate a high-quality text-only dataset into multimodal training, alongside
a substantial amount of multimodal math and reasoning data, leading to enhanced
math and coding capabilities across modalities. To advance research in the
field, we are releasing the model weights and will open-source the code for the
community: https://nvlm-project.github.io/.