Q-BENCH：从单一图像到成对图像的低级视觉多模式基础模型基准。

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-21 DOI:10.1109/TPAMI.2024.3445770

Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin

{"title":"Q-BENCH：从单一图像到成对图像的低级视觉多模式基础模型基准。","authors":"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":null,"url":null,"abstract":"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \nlow-level visual perception and understanding\n remains a yet-to-explore domain. To this end, we design benchmark settings to \nemulate human language responses\n related to low-level vision: the low-level visual \nperception\n (\nA1\n) \nvia\n visual question answering related to low-level attributes (\ne.g. clarity, lighting\n); and the low-level visual \ndescription\n (\nA2\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \nimage pairs\n. Specifically, for \nperception\n (A1), we carry out the LLVisionQA\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \n<bold/>\ndescription\n<bold/>\n (A2), we propose the LLDescribe\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \n<bold/>\nassessment\n<bold/>\n (A3) ability, \ni.e.\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \nquantifiable\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\nlike humans\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10404-10418"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs\",\"authors\":\"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin\",\"doi\":\"10.1109/TPAMI.2024.3445770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \\nlow-level visual perception and understanding\\n remains a yet-to-explore domain. To this end, we design benchmark settings to \\nemulate human language responses\\n related to low-level vision: the low-level visual \\nperception\\n (\\nA1\\n) \\nvia\\n visual question answering related to low-level attributes (\\ne.g. clarity, lighting\\n); and the low-level visual \\ndescription\\n (\\nA2\\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \\nimage pairs\\n. Specifically, for \\nperception\\n (A1), we carry out the LLVisionQA\\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \\n<bold/>\\ndescription\\n<bold/>\\n (A2), we propose the LLDescribe\\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \\n<bold/>\\nassessment\\n<bold/>\\n (A3) ability, \\ni.e.\\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \\nquantifiable\\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\\nlike humans\\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"10404-10418\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10643329/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643329/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多模态大语言模型（MLLMs）的快速发展引领了计算机视觉领域的范式转变，使其朝着多功能基础模型的方向发展。然而，在低级视觉感知和理解方面评估 MLLM 仍然是一个有待探索的领域。为此，我们设计了基准设置来模拟与低级视觉相关的人类语言反应：低级视觉感知（A1），通过与低级属性（如清晰度、照明）相关的视觉问题解答；以及低级视觉描述（A2），用于评估低级文本描述的 MLLM。此外，鉴于成对比较可以更好地避免回答的模糊性，并且已被许多人类实验所采用，我们进一步将 MLLM 的低层次感知相关问题解答和描述评估从单一图像扩展到图像对。具体来说，在感知（A1）方面，我们使用了 LLVisionQA+ 数据集，其中包括 2,990 张单张图像和 1,999 对图像，每张图像都附有一个关于其底层特征的开放式问题；在描述（A2）方面，我们提出了 LLDescribe+ 数据集，在 499 张单张图像和 450 对图像上评估了用于底层描述的 MLLM。此外，我们还评估了 MLLM 的评估（A3）能力，即预测得分，通过采用基于 softmax 的方法，使所有 MLLM 都能生成可量化的质量评级，并在 7 个图像质量评估（IQA）数据集中根据人类意见进行测试。通过对 24 种 MLLM 的评估，我们证明了几种 MLLM 在单幅图像上都具有不错的低级视觉能力，但只有 GPT-4V 在成对比较上比单幅图像评估（如人类）表现出更高的准确性。我们希望我们的基准能激励人们进一步研究如何发掘和提高 MLLM 的这些新生能力。数据集将发布在 https://github.com/Q-Future/Q-Bench 网站上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs

The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception ( A1 ) via visual question answering related to low-level attributes ( e.g. clarity, lighting ); and the low-level visual description ( A2 ), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs . Specifically, for perception (A1), we carry out the LLVisionQA

$^{+}$

dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe

$^{+}$

dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations ( like humans ). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量