Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance

IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI:10.1109/TMM.2024.3521741
Weikang Wang;Yuting Su;Jing Liu;Wei Sun;Guangtao Zhai
{"title":"Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance","authors":"Weikang Wang;Yuting Su;Jing Liu;Wei Sun;Guangtao Zhai","doi":"10.1109/TMM.2024.3521741","DOIUrl":null,"url":null,"abstract":"Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1320-1333"},"PeriodicalIF":9.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812857/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用以对象为中心的伪向导进行弱监督参考视频对象分割
参考视频对象分割(RVOS)是一项新兴的多模态视频理解任务,但昂贵的对象掩码标注过程限制了RVOS数据集的可扩展性和多样性。为了摆脱对昂贵的掩码标注的依赖,利用大规模部分标注数据,本文探索了一种新的扩展RVOS任务,即弱监督参考视频对象分割(WRVOS),该任务采用多个弱监督源,包括目标点和边界框。相应地,我们提出了一个统一的WRVOS框架。具体来说,提出了一种以目标为中心的伪掩码生成方法,为空间目标定位的伪导引提供有效的形状先验。然后,提出了一种伪引导优化策略,利用多阶段在线学习策略,从空间位置和投影密度两方面对目标轮廓进行有效优化。在此基础上,提出了一种考虑时间一致性和跨模态交互的多模态跨帧水平集进化方法,迭代细化目标边界。在A2D句子、J-HMDB句子、Ref-DAVIS和Ref-YoutubeVOS四个公开的RVOS数据集上进行了大量的实验。性能比较表明,所提出的方法在点监督和盒监督设置下都达到了最先进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
期刊最新文献
Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model. TMT: Tri-Modal Translation Between Speech, Image, and Text by Processing Different Modalities as Different Languages HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming 2025 Reviewers List Light CNN-Transformer Dual-Branch Network for Real-Time Semantic Segmentation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1