End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Chao Wang Chao Wang, Wei Luo Chao Wang, Jia-Rui Zhu Wei Luo, Ying-Chun Xia Jia-Rui Zhu, Jin He Ying-Chun Xia, Li-Chuan Gu Jin He
{"title":"End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning","authors":"Chao Wang Chao Wang, Wei Luo Chao Wang, Jia-Rui Zhu Wei Luo, Ying-Chun Xia Jia-Rui Zhu, Jin He Ying-Chun Xia, Li-Chuan Gu Jin He","doi":"10.53106/199115992024023501006","DOIUrl":null,"url":null,"abstract":"\n Visual grounding locates target objects or areas in the image based on natural language expression. Most current methods extract visual features and text embeddings independently, and then carry out complex fusion reasoning to locate target objects mentioned in the query text. However, such independently extracted visual features often contain many features that are irrelevant to the query text or misleading, thus affecting the subsequent multimodal fusion module, and deteriorating target localization. This study introduces a combined network model based on the transformer architecture, which realizes more accurate visual grounding by using query text to guide visual feature generation and multi-stage fusion reasoning. Specifically, the visual feature generation module reduces the interferences of irrelevant features and generates visual features related to query text through the guidance of query text features. The multi-stage fused reasoning module uses the relevant visual features obtained by the visual feature generation module and the query text embeddings for multi-stage interactive reasoning, further infers the correlation between the target image and the query text, so as to achieve the accurate localization of the object described by the query text. The effectiveness of the proposed model is experimentally verified on five public datasets and the model outperforms state-of-the-art methods. It achieves an improvement of 1.04%, 2.23%, 1.00% and +2.51% over the previous state-of-the-art methods in terms of the top-1 accuracy on TestA and TestB of the RefCOCO and RefCOCO+ datasets, respectively.\n \n","PeriodicalId":345067,"journal":{"name":"電腦學刊","volume":"49 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"電腦學刊","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53106/199115992024023501006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Visual grounding locates target objects or areas in the image based on natural language expression. Most current methods extract visual features and text embeddings independently, and then carry out complex fusion reasoning to locate target objects mentioned in the query text. However, such independently extracted visual features often contain many features that are irrelevant to the query text or misleading, thus affecting the subsequent multimodal fusion module, and deteriorating target localization. This study introduces a combined network model based on the transformer architecture, which realizes more accurate visual grounding by using query text to guide visual feature generation and multi-stage fusion reasoning. Specifically, the visual feature generation module reduces the interferences of irrelevant features and generates visual features related to query text through the guidance of query text features. The multi-stage fused reasoning module uses the relevant visual features obtained by the visual feature generation module and the query text embeddings for multi-stage interactive reasoning, further infers the correlation between the target image and the query text, so as to achieve the accurate localization of the object described by the query text. The effectiveness of the proposed model is experimentally verified on five public datasets and the model outperforms state-of-the-art methods. It achieves an improvement of 1.04%, 2.23%, 1.00% and +2.51% over the previous state-of-the-art methods in terms of the top-1 accuracy on TestA and TestB of the RefCOCO and RefCOCO+ datasets, respectively.  
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于查询文本引导和多阶段推理的端到端可视化接地
视觉定位是根据自然语言表达来定位图像中的目标对象或区域。目前大多数方法都是独立提取视觉特征和文本嵌入,然后进行复杂的融合推理,以定位查询文本中提到的目标对象。然而,这种独立提取的视觉特征往往包含许多与查询文本无关或具有误导性的特征,从而影响后续的多模态融合模块,并恶化目标定位效果。本研究引入了一种基于转换器架构的组合网络模型,通过利用查询文本指导视觉特征生成和多阶段融合推理,实现了更精确的视觉定位。具体来说,视觉特征生成模块通过查询文本特征的引导,减少无关特征的干扰,生成与查询文本相关的视觉特征。多阶段融合推理模块利用视觉特征生成模块获得的相关视觉特征和查询文本嵌入进行多阶段交互推理,进一步推导出目标图像与查询文本之间的相关性,从而实现对查询文本所描述对象的精确定位。我们在五个公开数据集上对所提模型的有效性进行了实验验证,结果表明该模型优于最先进的方法。在 RefCOCO 和 RefCOCO+ 数据集的 TestA 和 TestB 上,该模型的最高准确率分别比之前的先进方法提高了 1.04%、2.23%、1.00% 和 +2.51%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Novel Deep Neural Network for Facial Beauty Improvement ACANet: A Fine-grained Image Classification Optimization Method Based on Convolution and Attention Fusion Retinal OCT Image Classification Based on CNN-RNN Unified Neural Networks Beam Tracking Based on a New State Model for mmWave V2I Communication on 3D Roads Research on Strategies for Improving the Quality of English Blended Teaching in Vocational Colleges through Network Informatization Resources
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1