以 VLM 对具有说服力的非典型图像的推理能力为基准

Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka
{"title":"以 VLM 对具有说服力的非典型图像的推理能力为基准","authors":"Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka","doi":"arxiv-2409.10719","DOIUrl":null,"url":null,"abstract":"Vision language models (VLMs) have shown strong zero-shot generalization\nacross various tasks, especially when integrated with large language models\n(LLMs). However, their ability to comprehend rhetorical and persuasive visual\nmedia, such as advertisements, remains understudied. Ads often employ atypical\nimagery, using surprising object juxtapositions to convey shared properties.\nFor example, Fig. 1 (e) shows a beer with a feather-like texture. This requires\nadvanced reasoning to deduce that this atypical representation signifies the\nbeer's lightness. We introduce three novel tasks, Multi-label Atypicality\nClassification, Atypicality Statement Retrieval, and Aypical Object\nRecognition, to benchmark VLMs' understanding of atypicality in persuasive\nimages. We evaluate how well VLMs use atypicality to infer an ad's message and\ntest their reasoning abilities by employing semantically challenging negatives.\nFinally, we pioneer atypicality-aware verbalization by extracting comprehensive\nimage descriptions sensitive to atypical elements. Our findings reveal that:\n(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,\neffective strategies can extract atypicality-aware information, leading to\ncomprehensive image verbalization; (3) atypicality aids persuasive\nadvertisement understanding. Code and data will be made available.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking VLMs' Reasoning About Persuasive Atypical Images\",\"authors\":\"Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka\",\"doi\":\"arxiv-2409.10719\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision language models (VLMs) have shown strong zero-shot generalization\\nacross various tasks, especially when integrated with large language models\\n(LLMs). However, their ability to comprehend rhetorical and persuasive visual\\nmedia, such as advertisements, remains understudied. Ads often employ atypical\\nimagery, using surprising object juxtapositions to convey shared properties.\\nFor example, Fig. 1 (e) shows a beer with a feather-like texture. This requires\\nadvanced reasoning to deduce that this atypical representation signifies the\\nbeer's lightness. We introduce three novel tasks, Multi-label Atypicality\\nClassification, Atypicality Statement Retrieval, and Aypical Object\\nRecognition, to benchmark VLMs' understanding of atypicality in persuasive\\nimages. We evaluate how well VLMs use atypicality to infer an ad's message and\\ntest their reasoning abilities by employing semantically challenging negatives.\\nFinally, we pioneer atypicality-aware verbalization by extracting comprehensive\\nimage descriptions sensitive to atypical elements. Our findings reveal that:\\n(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,\\neffective strategies can extract atypicality-aware information, leading to\\ncomprehensive image verbalization; (3) atypicality aids persuasive\\nadvertisement understanding. Code and data will be made available.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10719\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10719","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

视觉语言模型(VLMs)在各种任务中都表现出很强的零点泛化能力,尤其是与大型语言模型(LLMs)集成时。然而,视觉语言模型理解修辞性和劝说性视觉媒体(如广告)的能力仍未得到充分研究。广告通常采用非典型图像,利用令人惊讶的物体并置来传达共同属性。例如,图 1 (e) 显示了一种具有羽毛般质感的啤酒。这需要高级推理才能推断出这种非典型的表现形式代表了啤酒的轻盈。我们引入了三个新任务:多标签非典型性分类、非典型性语句检索和非典型对象识别,以衡量 VLMs 对说服性图像中的非典型性的理解。最后,我们通过提取对非典型元素敏感的综合图像描述,开创了非典型感知语言化的先河。我们的研究结果表明:(1)与 LLM 相比,VLM 缺乏高级推理能力;(2)简单有效的策略可以提取非典型感知信息,从而实现全面的图像语言化;(3)非典型性有助于说服性广告的理解。将提供代码和数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Benchmarking VLMs' Reasoning About Persuasive Atypical Images
Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1