Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","authors":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12191","DOIUrl":null,"url":null,"abstract":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\nmodels that redefines the conventional predetermined-resolution approach in\nvisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\nwhich enables the model to dynamically process images of varying resolutions\ninto different numbers of visual tokens. This approach allows the model to\ngenerate more efficient and accurate visual representations, closely aligning\nwith human perceptual processes. The model also integrates Multimodal Rotary\nPosition Embedding (M-RoPE), facilitating the effective fusion of positional\ninformation across text, images, and videos. We employ a unified paradigm for\nprocessing both images and videos, enhancing the model's visual perception\ncapabilities. To explore the potential of large multimodal models, Qwen2-VL\ninvestigates the scaling laws for large vision-language models (LVLMs). By\nscaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\namount of training data, the Qwen2-VL Series achieves highly competitive\nperformance. Notably, the Qwen2-VL-72B model achieves results comparable to\nleading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\nbenchmarks, outperforming other generalist models. Code is available at\n\\url{https://github.com/QwenLM/Qwen2-VL}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL
models that redefines the conventional predetermined-resolution approach in
visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,
which enables the model to dynamically process images of varying resolutions
into different numbers of visual tokens. This approach allows the model to
generate more efficient and accurate visual representations, closely aligning
with human perceptual processes. The model also integrates Multimodal Rotary
Position Embedding (M-RoPE), facilitating the effective fusion of positional
information across text, images, and videos. We employ a unified paradigm for
processing both images and videos, enhancing the model's visual perception
capabilities. To explore the potential of large multimodal models, Qwen2-VL
investigates the scaling laws for large vision-language models (LVLMs). By
scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the
amount of training data, the Qwen2-VL Series achieves highly competitive
performance. Notably, the Qwen2-VL-72B model achieves results comparable to
leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal
benchmarks, outperforming other generalist models. Code is available at
\url{https://github.com/QwenLM/Qwen2-VL}.