Handling language prior and compositional reasoning issues in Visual Question Answering system

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2025-03-14 DOI:10.1016/j.neucom.2025.129906

Souvik Chowdhury, Badal Soni

引用次数: 0

Abstract

Visual Question Answering (VQA) models often suffer from language bias, favoring common but incorrect answers, and struggle with compositional reasoning in complex queries. This paper proposes a unified approach using a multimodal large language model enhanced with adaptive prompts designed for specific tasks. Our method directly addresses these issues by reducing language bias and improving compositional reasoning. Extensive evaluations on benchmark datasets, including VQA v2.0, VQACP, TDIUC, GQA, Visual7 W, TextVQA, and STVQA show that our approach outperforms state-of-the-art models, achieving accuracy improvements of 8% to 9%. These results demonstrate the effectiveness of our method in enhancing VQA accuracy, making it a significant advancement for more reliable and robust applications in real-world scenarios.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.