You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW) Pub Date : 2022-07-01 DOI:10.1109/ICDCSW56584.2022.00035

Qing Du, Yucheng Luo

{"title":"You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding","authors":"Qing Du, Yucheng Luo","doi":"10.1109/ICDCSW56584.2022.00035","DOIUrl":null,"url":null,"abstract":"Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.","PeriodicalId":357138,"journal":{"name":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCSW56584.2022.00035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

你只看和听一次:快速和准确的视觉基础

视觉基础(VG)的目的是定位图像中最相关的区域，基于灵活的自然语言查询，而不是预先定义的标签，因此它在实践中是一种有用的技术。VG中的大多数方法都是两阶段的，第一阶段使用目标检测器从输入图像中生成一组目标建议，第二阶段简单地表述为跨模态匹配问题。第一阶段可能产生数百个提案，需要在第二阶段进行比较，这对于实时VG应用程序来说是不可行的，并且第二阶段的性能可能会受到第一阶段的影响。在本文中，我们提出了一种更优雅的基于单阶段检测的方法，该方法将区域建议和匹配阶段连接为一个单一的检测网络。检测以输入查询为条件，使用一堆新颖的关系到注意模块，将图像到查询的关系转换为关系映射，用于直接预测边界框，而不会提出大量无用的区域建议。在推理过程中，我们的方法比以前的方法快20 ~ 30倍，值得注意的是，它在几个基准数据集上达到了相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)

自引率

0.00%

发文量

期刊最新文献

Holium: A Protocol for Data Transformation Pipelines ROFL: RObust privacy preserving Federated Learning Hyperverse: A High Throughput Pattern Matching Engine for Metaverse Cost-Effective Optimal Multi-Source Energy Management Technique in Heterogeneous Networks Local Model Quality Control Method Based on Credit Mortgage for Enterprise Credit Evaluation