Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu
{"title":"Multi-modal Contextual Graph Neural Network for Text Visual Question Answering","authors":"Yaoyuan Liang, Xin Wang, Xuguang Duan, Wenwu Zhu","doi":"10.1109/ICPR48806.2021.9412891","DOIUrl":null,"url":null,"abstract":"Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of existing works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"79 1","pages":"3491-3498"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9412891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of existing works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文本视觉问答的多模态上下文图神经网络
文本视觉问答(TextVQA)的目标是回答与给定图像中出现的文本相关的问题,这比VQA更具挑战性,因为它需要更深入地识别和理解人类可读场景文本的各种形状及其在不同上下文中的含义。现有的TextVQA工作存在两个弱点:1)场景文本和非文本对象在理解和回答问题的过程中被分开、独立地处理,而没有考虑它们之间的相互作用;2)场景文本仅通过词嵌入进行编码,而没有考虑相应的视觉外观特征以及它们与图像中其他非文本对象的潜在关系。为了克服现有工作的不足,我们提出了一种新的TextVQA多模态上下文图神经网络(MCG)模型。所提出的MCG模型能够捕捉给定图像中场景文本和非文本对象的视觉特征之间的关系,并利用更丰富的多模态特征来源来提高模型的性能。特别是,我们将场景文本编码为包含文本、视觉和位置特征的更丰富的特征,然后通过上下文图神经网络对场景文本和非文本对象之间的视觉关系进行建模。我们在真实数据集上的大量实验证明了所提出的MCG模型相对于基线方法的优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Trajectory representation learning for Multi-Task NMRDP planning Semantic Segmentation Refinement Using Entropy and Boundary-guided Monte Carlo Sampling and Directed Regional Search A Randomized Algorithm for Sparse Recovery An Empirical Bayes Approach to Topic Modeling To Honor our Heroes: Analysis of the Obituaries of Australians Killed in Action in WWI and WWII
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1