Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs

Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin
{"title":"Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs","authors":"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":null,"url":null,"abstract":"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \n<i>low-level visual perception and understanding</i>\n remains a yet-to-explore domain. To this end, we design benchmark settings to \n<i>emulate human language responses</i>\n related to low-level vision: the low-level visual \n<i>perception</i>\n (\n<u>A1</u>\n) \n<i>via</i>\n visual question answering related to low-level attributes (\n<i>e.g. clarity, lighting</i>\n); and the low-level visual \n<i>description</i>\n (\n<u>A2</u>\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \n<i>image pairs</i>\n. Specifically, for \n<i>perception</i>\n (A1), we carry out the LLVisionQA\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \n<bold/>\n<i>description</i>\n<bold/>\n (A2), we propose the LLDescribe\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \n<bold/>\n<i>assessment</i>\n<bold/>\n (A3) ability, \n<i>i.e.</i>\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \n<i>quantifiable</i>\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\n<i>like humans</i>\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10404-10418"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643329/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception ( A1 ) via visual question answering related to low-level attributes ( e.g. clarity, lighting ); and the low-level visual description ( A2 ), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs . Specifically, for perception (A1), we carry out the LLVisionQA $^{+}$ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe $^{+}$ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations ( like humans ). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Q-BENCH:从单一图像到成对图像的低级视觉多模式基础模型基准。
多模态大语言模型(MLLMs)的快速发展引领了计算机视觉领域的范式转变,使其朝着多功能基础模型的方向发展。然而,在低级视觉感知和理解方面评估 MLLM 仍然是一个有待探索的领域。为此,我们设计了基准设置来模拟与低级视觉相关的人类语言反应:低级视觉感知(A1),通过与低级属性(如清晰度、照明)相关的视觉问题解答;以及低级视觉描述(A2),用于评估低级文本描述的 MLLM。此外,鉴于成对比较可以更好地避免回答的模糊性,并且已被许多人类实验所采用,我们进一步将 MLLM 的低层次感知相关问题解答和描述评估从单一图像扩展到图像对。具体来说,在感知(A1)方面,我们使用了 LLVisionQA+ 数据集,其中包括 2,990 张单张图像和 1,999 对图像,每张图像都附有一个关于其底层特征的开放式问题;在描述(A2)方面,我们提出了 LLDescribe+ 数据集,在 499 张单张图像和 450 对图像上评估了用于底层描述的 MLLM。此外,我们还评估了 MLLM 的评估(A3)能力,即预测得分,通过采用基于 softmax 的方法,使所有 MLLM 都能生成可量化的质量评级,并在 7 个图像质量评估(IQA)数据集中根据人类意见进行测试。通过对 24 种 MLLM 的评估,我们证明了几种 MLLM 在单幅图像上都具有不错的低级视觉能力,但只有 GPT-4V 在成对比较上比单幅图像评估(如人类)表现出更高的准确性。我们希望我们的基准能激励人们进一步研究如何发掘和提高 MLLM 的这些新生能力。数据集将发布在 https://github.com/Q-Future/Q-Bench 网站上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Language-Inspired Relation Transfer for Few-Shot Class-Incremental Learning. Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing. 360SFUDA++: Towards Source-Free UDA for Panoramic Segmentation by Learning Reliable Category Prototypes. Anti-Forgetting Adaptation for Unsupervised Person Re-Identification. Evolved Hierarchical Masking for Self-Supervised Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1