A Masked Reference Token Supervision-Based Iterative Visual-Language Framework for Robust Visual Grounding

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-08-30 DOI:10.1109/TCSVT.2024.3452418

Chunlei Wang;Wenquan Feng;Shuchang Lyu;Guangliang Cheng;Xiangtai Li;Binghao Liu;Qi Zhao

{"title":"A Masked Reference Token Supervision-Based Iterative Visual-Language Framework for Robust Visual Grounding","authors":"Chunlei Wang;Wenquan Feng;Shuchang Lyu;Guangliang Cheng;Xiangtai Li;Binghao Liu;Qi Zhao","doi":"10.1109/TCSVT.2024.3452418","DOIUrl":null,"url":null,"abstract":"Visual Grounding (VG) has become a prominent task in recent years, achieving significant advancements with the development of detection and vision transformers. However, existing VG methods struggle to handle the effects of inaccurate or irrelevant textual descriptions, tending to generate false-alarm objects. Moreover, existing methods fail to capture fine-grained features, accurate localization, and comprehensive context understanding from the whole image and textual descriptions. To address these issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Multi-stage False-alarm Sensitive Decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The framework introduces Masked Reference based Centerpoint Supervision (MRCS) and Iterative Multi-level Vision-language Fusion (IMVF) for enhancing the accuracy of localization and better visual-language alignment. To investigate the elements that affect VG robustness further, we release a robust VG benchmark with 24,000 instances and we also provide a detailed classification of false-alarm according to different parts of speech. Extensive experiments on existing state-of-the-art (SOTA) VG methods and foundation models have proven that it is difficult to handle the robustness of VG by existing models. Even foundation models, which have been pre-trained with a large amount of data, have difficulty to understand inaccurate language descriptions. Our IR-VG can handle false-alarm issues in robust VG well and achieve new SOTA results on the newly proposed robust VG datasets. Ablation studies and visualization experiments demonstrate the effectiveness of the proposed components. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at <uri>https://github.com/cv516Buaa/IR-VG</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"75-90"},"PeriodicalIF":11.1000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10659810/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Grounding (VG) has become a prominent task in recent years, achieving significant advancements with the development of detection and vision transformers. However, existing VG methods struggle to handle the effects of inaccurate or irrelevant textual descriptions, tending to generate false-alarm objects. Moreover, existing methods fail to capture fine-grained features, accurate localization, and comprehensive context understanding from the whole image and textual descriptions. To address these issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Multi-stage False-alarm Sensitive Decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The framework introduces Masked Reference based Centerpoint Supervision (MRCS) and Iterative Multi-level Vision-language Fusion (IMVF) for enhancing the accuracy of localization and better visual-language alignment. To investigate the elements that affect VG robustness further, we release a robust VG benchmark with 24,000 instances and we also provide a detailed classification of false-alarm according to different parts of speech. Extensive experiments on existing state-of-the-art (SOTA) VG methods and foundation models have proven that it is difficult to handle the robustness of VG by existing models. Even foundation models, which have been pre-trained with a large amount of data, have difficulty to understand inaccurate language descriptions. Our IR-VG can handle false-alarm issues in robust VG well and achieve new SOTA results on the newly proposed robust VG datasets. Ablation studies and visualization experiments demonstrate the effectiveness of the proposed components. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at https://github.com/cv516Buaa/IR-VG.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于掩码参考标记监督的迭代视觉语言框架，实现稳健的视觉接地

近年来，视觉接地（VG）已成为一项突出的任务，随着检测和视觉变压器的发展取得了重大进展。然而，现有的VG方法很难处理不准确或不相关的文本描述的影响，容易产生假警报对象。此外，现有方法无法从整个图像和文本描述中捕获细粒度特征、准确定位和全面的上下文理解。为了解决这些问题，我们提出了一个带有多级假警报敏感解码器（MFSD）的迭代鲁棒视觉基础（IR-VG）框架，以防止在呈现不准确的表达式时产生假警报对象。该框架引入了基于屏蔽参考的中心点监督（MRCS）和迭代多级视觉语言融合（IMVF）来提高定位精度和更好的视觉语言对齐。为了进一步研究影响VG健壮性的因素，我们发布了一个包含24,000个实例的健壮的VG基准测试，我们还根据不同的词性提供了假警报的详细分类。对现有SOTA（最先进的）VG方法和基础模型的大量实验证明，现有模型难以处理VG的鲁棒性。即使是经过大量数据预训练的基础模型，也很难理解不准确的语言描述。我们的IR-VG可以很好地处理鲁棒性VG中的假警报问题，并在新提出的鲁棒性VG数据集上获得新的SOTA结果。烧蚀研究和可视化实验证明了所提出组件的有效性。此外，在5个常规VG数据集上验证了该框架的有效性。代码和模型将在https://github.com/cv516Buaa/IR-VG上公开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.

期刊最新文献

IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information