Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-21 DOI:10.1109/TPAMI.2024.3445770

Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin

{"title":"Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs","authors":"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":null,"url":null,"abstract":"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \nlow-level visual perception and understanding\n remains a yet-to-explore domain. To this end, we design benchmark settings to \nemulate human language responses\n related to low-level vision: the low-level visual \nperception\n (\nA1\n) \nvia\n visual question answering related to low-level attributes (\ne.g. clarity, lighting\n); and the low-level visual \ndescription\n (\nA2\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \nimage pairs\n. Specifically, for \nperception\n (A1), we carry out the LLVisionQA\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \n<bold/>\ndescription\n<bold/>\n (A2), we propose the LLDescribe\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \n<bold/>\nassessment\n<bold/>\n (A3) ability, \ni.e.\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \nquantifiable\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\nlike humans\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10404-10418"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643329/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception ( A1 ) via visual question answering related to low-level attributes ( e.g. clarity, lighting ); and the low-level visual description ( A2 ), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs . Specifically, for perception (A1), we carry out the LLVisionQA

$^{+}$

dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe

$^{+}$

dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations ( like humans ). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Q-BENCH：从单一图像到成对图像的低级视觉多模式基础模型基准。

多模态大语言模型（MLLMs）的快速发展引领了计算机视觉领域的范式转变，使其朝着多功能基础模型的方向发展。然而，在低级视觉感知和理解方面评估 MLLM 仍然是一个有待探索的领域。为此，我们设计了基准设置来模拟与低级视觉相关的人类语言反应：低级视觉感知（A1），通过与低级属性（如清晰度、照明）相关的视觉问题解答；以及低级视觉描述（A2），用于评估低级文本描述的 MLLM。此外，鉴于成对比较可以更好地避免回答的模糊性，并且已被许多人类实验所采用，我们进一步将 MLLM 的低层次感知相关问题解答和描述评估从单一图像扩展到图像对。具体来说，在感知（A1）方面，我们使用了 LLVisionQA+ 数据集，其中包括 2,990 张单张图像和 1,999 对图像，每张图像都附有一个关于其底层特征的开放式问题；在描述（A2）方面，我们提出了 LLDescribe+ 数据集，在 499 张单张图像和 450 对图像上评估了用于底层描述的 MLLM。此外，我们还评估了 MLLM 的评估（A3）能力，即预测得分，通过采用基于 softmax 的方法，使所有 MLLM 都能生成可量化的质量评级，并在 7 个图像质量评估（IQA）数据集中根据人类意见进行测试。通过对 24 种 MLLM 的评估，我们证明了几种 MLLM 在单幅图像上都具有不错的低级视觉能力，但只有 GPT-4V 在成对比较上比单幅图像评估（如人类）表现出更高的准确性。我们希望我们的基准能激励人们进一步研究如何发掘和提高 MLLM 的这些新生能力。数据集将发布在 https://github.com/Q-Future/Q-Bench 网站上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量