Weipeng Chen , Xu Huang , Zifeng Liu , Jin Liu , Lan Yo
{"title":"RK-VQA: Rational knowledge-aware fusion-in-decoder for knowledge-based visual question answering","authors":"Weipeng Chen , Xu Huang , Zifeng Liu , Jin Liu , Lan Yo","doi":"10.1016/j.inffus.2025.102969","DOIUrl":null,"url":null,"abstract":"<div><div>Knowledge-based Visual Question Answering (KB-VQA) expands traditional VQA by utilizing world knowledge from external sources when the image alone is insufficient to infer a correct answer. Existing methods face challenges due to low recall rates, limiting the ability to gather essential information for accurate answers. While increasing the amount of retrieved knowledge entries can enhance recall, it often introduces irrelevant information, adversely impairing model performance. To overcome these challenges, we propose RK-VQA, which comprises two components: First, a zero-shot weighted hybrid knowledge retrieval method that integrates local and global visual features with textual features from image–question pairs, enhancing the quality of knowledge retrieval and improving recall rates. Second, a rational knowledge-aware Fusion-in-Decoder architecture enhances answer generation by focusing on rational knowledge and reducing the influence of irrelevant information. Specifically, we develop a rational module to extract rational features, subsequently utilized to prioritize pertinent information via a novel rational knowledge-aware attention mechanism. We evaluate our RK-VQA on the OK-VQA, which is the largest knowledge-based VQA dataset. The results demonstrate that RK-VQA achieves significant results, recording an accuracy of 64.11%, surpassing the previous best result by 2.03%.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"118 ","pages":"Article 102969"},"PeriodicalIF":14.7000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525000429","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Knowledge-based Visual Question Answering (KB-VQA) expands traditional VQA by utilizing world knowledge from external sources when the image alone is insufficient to infer a correct answer. Existing methods face challenges due to low recall rates, limiting the ability to gather essential information for accurate answers. While increasing the amount of retrieved knowledge entries can enhance recall, it often introduces irrelevant information, adversely impairing model performance. To overcome these challenges, we propose RK-VQA, which comprises two components: First, a zero-shot weighted hybrid knowledge retrieval method that integrates local and global visual features with textual features from image–question pairs, enhancing the quality of knowledge retrieval and improving recall rates. Second, a rational knowledge-aware Fusion-in-Decoder architecture enhances answer generation by focusing on rational knowledge and reducing the influence of irrelevant information. Specifically, we develop a rational module to extract rational features, subsequently utilized to prioritize pertinent information via a novel rational knowledge-aware attention mechanism. We evaluate our RK-VQA on the OK-VQA, which is the largest knowledge-based VQA dataset. The results demonstrate that RK-VQA achieves significant results, recording an accuracy of 64.11%, surpassing the previous best result by 2.03%.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.