Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI:10.1109/TMM.2024.3521741

Weikang Wang;Yuting Su;Jing Liu;Wei Sun;Guangtao Zhai

{"title":"Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance","authors":"Weikang Wang;Yuting Su;Jing Liu;Wei Sun;Guangtao Zhai","doi":"10.1109/TMM.2024.3521741","DOIUrl":null,"url":null,"abstract":"Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1320-1333"},"PeriodicalIF":9.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812857/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用以对象为中心的伪向导进行弱监督参考视频对象分割

参考视频对象分割（RVOS）是一项新兴的多模态视频理解任务，但昂贵的对象掩码标注过程限制了RVOS数据集的可扩展性和多样性。为了摆脱对昂贵的掩码标注的依赖，利用大规模部分标注数据，本文探索了一种新的扩展RVOS任务，即弱监督参考视频对象分割（WRVOS），该任务采用多个弱监督源，包括目标点和边界框。相应地，我们提出了一个统一的WRVOS框架。具体来说，提出了一种以目标为中心的伪掩码生成方法，为空间目标定位的伪导引提供有效的形状先验。然后，提出了一种伪引导优化策略，利用多阶段在线学习策略，从空间位置和投影密度两方面对目标轮廓进行有效优化。在此基础上，提出了一种考虑时间一致性和跨模态交互的多模态跨帧水平集进化方法，迭代细化目标边界。在A2D句子、J-HMDB句子、Ref-DAVIS和Ref-YoutubeVOS四个公开的RVOS数据集上进行了大量的实验。性能比较表明，所提出的方法在点监督和盒监督设置下都达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.