Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin
{"title":"Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs","authors":"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":null,"url":null,"abstract":"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \n<i>low-level visual perception and understanding</i>\n remains a yet-to-explore domain. To this end, we design benchmark settings to \n<i>emulate human language responses</i>\n related to low-level vision: the low-level visual \n<i>perception</i>\n (\n<u>A1</u>\n) \n<i>via</i>\n visual question answering related to low-level attributes (\n<i>e.g. clarity, lighting</i>\n); and the low-level visual \n<i>description</i>\n (\n<u>A2</u>\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \n<i>image pairs</i>\n. Specifically, for \n<i>perception</i>\n (A1), we carry out the LLVisionQA\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \n<bold/>\n<i>description</i>\n<bold/>\n (A2), we propose the LLDescribe\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \n<bold/>\n<i>assessment</i>\n<bold/>\n (A3) ability, \n<i>i.e.</i>\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \n<i>quantifiable</i>\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\n<i>like humans</i>\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10404-10418"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643329/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in
low-level visual perception and understanding
remains a yet-to-explore domain. To this end, we design benchmark settings to
emulate human language responses
related to low-level vision: the low-level visual
perception
(
A1
)
via
visual question answering related to low-level attributes (
e.g. clarity, lighting
); and the low-level visual
description
(
A2
), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to
image pairs
. Specifically, for
perception
(A1), we carry out the LLVisionQA
$^{+}$
dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for
description
(A2), we propose the LLDescribe
$^{+}$
dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on
assessment
(A3) ability,
i.e.
predicting score, by employing a softmax-based approach to enable all MLLMs to generate
quantifiable
quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (
like humans
). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.