Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka
{"title":"Benchmarking VLMs' Reasoning About Persuasive Atypical Images","authors":"Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka","doi":"arxiv-2409.10719","DOIUrl":null,"url":null,"abstract":"Vision language models (VLMs) have shown strong zero-shot generalization\nacross various tasks, especially when integrated with large language models\n(LLMs). However, their ability to comprehend rhetorical and persuasive visual\nmedia, such as advertisements, remains understudied. Ads often employ atypical\nimagery, using surprising object juxtapositions to convey shared properties.\nFor example, Fig. 1 (e) shows a beer with a feather-like texture. This requires\nadvanced reasoning to deduce that this atypical representation signifies the\nbeer's lightness. We introduce three novel tasks, Multi-label Atypicality\nClassification, Atypicality Statement Retrieval, and Aypical Object\nRecognition, to benchmark VLMs' understanding of atypicality in persuasive\nimages. We evaluate how well VLMs use atypicality to infer an ad's message and\ntest their reasoning abilities by employing semantically challenging negatives.\nFinally, we pioneer atypicality-aware verbalization by extracting comprehensive\nimage descriptions sensitive to atypical elements. Our findings reveal that:\n(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,\neffective strategies can extract atypicality-aware information, leading to\ncomprehensive image verbalization; (3) atypicality aids persuasive\nadvertisement understanding. Code and data will be made available.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10719","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vision language models (VLMs) have shown strong zero-shot generalization
across various tasks, especially when integrated with large language models
(LLMs). However, their ability to comprehend rhetorical and persuasive visual
media, such as advertisements, remains understudied. Ads often employ atypical
imagery, using surprising object juxtapositions to convey shared properties.
For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires
advanced reasoning to deduce that this atypical representation signifies the
beer's lightness. We introduce three novel tasks, Multi-label Atypicality
Classification, Atypicality Statement Retrieval, and Aypical Object
Recognition, to benchmark VLMs' understanding of atypicality in persuasive
images. We evaluate how well VLMs use atypicality to infer an ad's message and
test their reasoning abilities by employing semantically challenging negatives.
Finally, we pioneer atypicality-aware verbalization by extracting comprehensive
image descriptions sensitive to atypical elements. Our findings reveal that:
(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,
effective strategies can extract atypicality-aware information, leading to
comprehensive image verbalization; (3) atypicality aids persuasive
advertisement understanding. Code and data will be made available.