Towards bias-aware visual question answering: Rectifying and mitigating comprehension biases

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems with Applications Pub Date : 2025-03-10 Epub Date: 2024-11-23 DOI:10.1016/j.eswa.2024.125817

Chongqing Chen , Dezhi Han , Zihan Guo , Chin-Chen Chang

{"title":"Towards bias-aware visual question answering: Rectifying and mitigating comprehension biases","authors":"Chongqing Chen , Dezhi Han , Zihan Guo , Chin-Chen Chang","doi":"10.1016/j.eswa.2024.125817","DOIUrl":null,"url":null,"abstract":"<div><div>Transformers have become essential for capturing intra- and inter-dependencies in visual question answering (VQA). Yet, challenges remain in overcoming inherent comprehension biases and improving the relational dependency modeling and reasoning capabilities crucial for VQA tasks. This paper presents RMCB, a novel VQA model designed to mitigate these biases by integrating contextual information from both visual and linguistic sources and addressing potential comprehension limitations at each end. RMCB introduces enhanced relational modeling for language tokens by leveraging textual context, addressing comprehension biases arising from the isolated pairwise modeling of token relationships. For the visual component, RMCB systematically incorporates both absolute and relative spatial relational information as contextual cues for image tokens, refining dependency modeling and strengthening inferential reasoning to alleviate biases caused by limited contextual understanding. The model’s effectiveness was evaluated on benchmark datasets VQA-v2 and CLEVR, achieving state-of-the-art results with accuracies of 71.78% and 99.27%, respectively. These results underscore RMCB’s capability to effectively address comprehension biases while advancing the relational reasoning needed for VQA.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"264 ","pages":"Article 125817"},"PeriodicalIF":7.5000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417424026848","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/23 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers have become essential for capturing intra- and inter-dependencies in visual question answering (VQA). Yet, challenges remain in overcoming inherent comprehension biases and improving the relational dependency modeling and reasoning capabilities crucial for VQA tasks. This paper presents RMCB, a novel VQA model designed to mitigate these biases by integrating contextual information from both visual and linguistic sources and addressing potential comprehension limitations at each end. RMCB introduces enhanced relational modeling for language tokens by leveraging textual context, addressing comprehension biases arising from the isolated pairwise modeling of token relationships. For the visual component, RMCB systematically incorporates both absolute and relative spatial relational information as contextual cues for image tokens, refining dependency modeling and strengthening inferential reasoning to alleviate biases caused by limited contextual understanding. The model’s effectiveness was evaluated on benchmark datasets VQA-v2 and CLEVR, achieving state-of-the-art results with accuracies of 71.78% and 99.27%, respectively. These results underscore RMCB’s capability to effectively address comprehension biases while advancing the relational reasoning needed for VQA.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

实现具有偏见感知能力的视觉问题解答：纠正和减轻理解偏差

变换器对于捕捉视觉问题解答（VQA）中的内部和相互依赖关系至关重要。然而，在克服固有的理解偏差和提高对 VQA 任务至关重要的关系依赖建模和推理能力方面仍然存在挑战。本文介绍的 RMCB 是一种新颖的 VQA 模型，旨在通过整合来自视觉和语言来源的上下文信息以及解决两端潜在的理解限制来减轻这些偏差。RMCB 利用文本上下文为语言标记引入了增强的关系建模，解决了孤立的标记关系配对建模所产生的理解偏差。在视觉部分，RMCB 系统地将绝对和相对空间关系信息作为图像标记的上下文线索，完善了依赖关系建模，加强了推理能力，从而减轻了因上下文理解有限而产生的偏差。该模型的有效性在基准数据集 VQA-v2 和 CLEVR 上进行了评估，结果达到了最先进的水平，准确率分别为 71.78% 和 99.27%。这些结果凸显了 RMCB 在推进 VQA 所需的关系推理的同时有效解决理解偏差的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.