In knowledge-based visual question answering, most current research focuses on the integration of external knowledge with VQA systems. However, the extraction of visual features within knowledge-based VQA remains relatively unexplored. This is surprising since even for the same image, answering different questions requires attention to different visual regions. In this paper, we propose a novel question-guided multigranular visual augmentation method for knowledge-based VQA tasks. Our method uses input questions to identify and focus on question-related regions within the image, which improves prediction quality. Specifically, our method first performs semantic embedding learning for questions at both the word-level and the phrase-level. To preserve rich visual information for QA, our method uses questions as a guide to extract question-related visual features. This is implemented by multiple convolution operations. In these operations, the convolutional kernels are dynamically derived from the representations of questions. By capturing visual information from diverse perspectives, our method extract information at the word level, phrase level, and common level more comprehensively. Additionally, relevant knowledge is retrieved from knowledge graph through entity linking and random walk techniques to respond to the question. A series of experiments are conducted on public knowledge-based VQA datasets to demonstrate the effectiveness of our model. The experimental results show that our method achieves state-of-the-art performance.
扫码关注我们
求助内容:
应助结果提醒方式:
