Graph-enhanced visual representations and question-guided dual attention for visual question answering

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-11-07 DOI:10.1016/j.neucom.2024.128850
Abdulganiyu Abdu Yusuf , Chong Feng , Xianling Mao , Yunusa Haruna , Xinyan Li , Ramadhani Ally Duma
{"title":"Graph-enhanced visual representations and question-guided dual attention for visual question answering","authors":"Abdulganiyu Abdu Yusuf ,&nbsp;Chong Feng ,&nbsp;Xianling Mao ,&nbsp;Yunusa Haruna ,&nbsp;Xinyan Li ,&nbsp;Ramadhani Ally Duma","doi":"10.1016/j.neucom.2024.128850","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region’s feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128850"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016217","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region’s feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
图形增强的视觉表征和问题引导的双重注意力用于视觉问题解答
由于深度学习在视觉语言研究领域的应用,视觉问题解答(VQA)最近取得了重大进展。目前大多数 VQA 模型都侧重于合并视觉和文本特征,但这些模型还必须考虑图像不同部分之间的关系,并利用问题信息来突出重要特征。本研究提出了一种增强相邻图像区域特征和学习问题感知视觉表征的方法。首先,我们构建了一个区域图来表示图像中物体之间的空间关系。然后,使用图卷积网络(GCN)在相邻区域之间传播信息,通过整合上下文信息来丰富每个区域的特征表征。为了捕捉远距离的依赖关系,该图通过随机行走与重启(RWR)进行了增强,从而实现了跨远距离区域的多跳推理。此外,还引入了问题感知的双重关注机制,在区域和特征两个层面上进一步完善区域特征,确保模型强调对回答问题至关重要的关键区域。然后,增强的区域表征与编码的问题相结合,预测出答案。通过在 VQA 基准上进行广泛的实验,该研究利用区域依赖性和问题引导展示了最先进的性能。图中 GCN 和随机漫步的整合有助于捕捉上下文信息,有选择地集中视觉注意力,从而在 VQA 1.0 和 VQA 2.0 基准数据集上显著改进了现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
期刊最新文献
Editorial Board Extending the learning using privileged information paradigm to logistic regression DoA-ViT: Dual-objective Affine Vision Transformer for Data Insufficiency CNN explanation methods for ordinal regression tasks Superpixel semantics representation and pre-training for vision–language tasks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1