You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Qing Du, Yucheng Luo
{"title":"You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding","authors":"Qing Du, Yucheng Luo","doi":"10.1109/ICDCSW56584.2022.00035","DOIUrl":null,"url":null,"abstract":"Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.","PeriodicalId":357138,"journal":{"name":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCSW56584.2022.00035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
你只看和听一次:快速和准确的视觉基础
视觉基础(VG)的目的是定位图像中最相关的区域,基于灵活的自然语言查询,而不是预先定义的标签,因此它在实践中是一种有用的技术。VG中的大多数方法都是两阶段的,第一阶段使用目标检测器从输入图像中生成一组目标建议,第二阶段简单地表述为跨模态匹配问题。第一阶段可能产生数百个提案,需要在第二阶段进行比较,这对于实时VG应用程序来说是不可行的,并且第二阶段的性能可能会受到第一阶段的影响。在本文中,我们提出了一种更优雅的基于单阶段检测的方法,该方法将区域建议和匹配阶段连接为一个单一的检测网络。检测以输入查询为条件,使用一堆新颖的关系到注意模块,将图像到查询的关系转换为关系映射,用于直接预测边界框,而不会提出大量无用的区域建议。在推理过程中,我们的方法比以前的方法快20 ~ 30倍,值得注意的是,它在几个基准数据集上达到了相当的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Holium: A Protocol for Data Transformation Pipelines ROFL: RObust privacy preserving Federated Learning Hyperverse: A High Throughput Pattern Matching Engine for Metaverse Cost-Effective Optimal Multi-Source Energy Management Technique in Heterogeneous Networks Local Model Quality Control Method Based on Credit Mortgage for Enterprise Credit Evaluation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1