Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin
{"title":"Q-BENCH:从单一图像到成对图像的低级视觉多模式基础模型基准。","authors":"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":null,"url":null,"abstract":"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \n<i>low-level visual perception and understanding</i>\n remains a yet-to-explore domain. To this end, we design benchmark settings to \n<i>emulate human language responses</i>\n related to low-level vision: the low-level visual \n<i>perception</i>\n (\n<u>A1</u>\n) \n<i>via</i>\n visual question answering related to low-level attributes (\n<i>e.g. clarity, lighting</i>\n); and the low-level visual \n<i>description</i>\n (\n<u>A2</u>\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \n<i>image pairs</i>\n. Specifically, for \n<i>perception</i>\n (A1), we carry out the LLVisionQA\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \n<bold/>\n<i>description</i>\n<bold/>\n (A2), we propose the LLDescribe\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \n<bold/>\n<i>assessment</i>\n<bold/>\n (A3) ability, \n<i>i.e.</i>\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \n<i>quantifiable</i>\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\n<i>like humans</i>\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10404-10418"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs\",\"authors\":\"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin\",\"doi\":\"10.1109/TPAMI.2024.3445770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \\n<i>low-level visual perception and understanding</i>\\n remains a yet-to-explore domain. To this end, we design benchmark settings to \\n<i>emulate human language responses</i>\\n related to low-level vision: the low-level visual \\n<i>perception</i>\\n (\\n<u>A1</u>\\n) \\n<i>via</i>\\n visual question answering related to low-level attributes (\\n<i>e.g. clarity, lighting</i>\\n); and the low-level visual \\n<i>description</i>\\n (\\n<u>A2</u>\\n), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \\n<i>image pairs</i>\\n. Specifically, for \\n<i>perception</i>\\n (A1), we carry out the LLVisionQA\\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\\n dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \\n<bold/>\\n<i>description</i>\\n<bold/>\\n (A2), we propose the LLDescribe\\n<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\\n dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \\n<bold/>\\n<i>assessment</i>\\n<bold/>\\n (A3) ability, \\n<i>i.e.</i>\\n predicting score, by employing a softmax-based approach to enable all MLLMs to generate \\n<i>quantifiable</i>\\n quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\\n<i>like humans</i>\\n). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"10404-10418\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10643329/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643329/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs
The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in
low-level visual perception and understanding
remains a yet-to-explore domain. To this end, we design benchmark settings to
emulate human language responses
related to low-level vision: the low-level visual
perception
(
A1
)
via
visual question answering related to low-level attributes (
e.g. clarity, lighting
); and the low-level visual
description
(
A2
), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to
image pairs
. Specifically, for
perception
(A1), we carry out the LLVisionQA
$^{+}$
dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for
description
(A2), we propose the LLDescribe
$^{+}$
dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on
assessment
(A3) ability,
i.e.
predicting score, by employing a softmax-based approach to enable all MLLMs to generate
quantifiable
quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (
like humans
). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.