MWVOS: Mask-Free Weakly Supervised Video Object Segmentation via promptable foundation model

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2025-03-01 Epub Date: 2024-10-31 DOI:10.1016/j.patcog.2024.111100

Zhenghao Zhang , Shengfan Zhang , Zuozhuo Dai , Zilong Dong , Siyu Zhu

{"title":"MWVOS: Mask-Free Weakly Supervised Video Object Segmentation via promptable foundation model","authors":"Zhenghao Zhang , Shengfan Zhang , Zuozhuo Dai , Zilong Dong , Siyu Zhu","doi":"10.1016/j.patcog.2024.111100","DOIUrl":null,"url":null,"abstract":"<div><div>The current state-of-the-art techniques for video object segmentation necessitate extensive training on video datasets with mask annotations, thereby constraining their ability to transfer zero-shot learning to new image distributions and tasks. However, recent advancements in foundation models, particularly in the domain of image segmentation, have showcased robust generalization capabilities, introducing a novel prompt-driven paradigm for a variety of downstream segmentation challenges on new data distributions. This study delves into the potential of vision foundation models using diverse prompt strategies and proposes a mask-free approach for unsupervised video object segmentation. To further improve the efficacy of prompt learning in diverse and complex video scenes, we introduce a spatial–temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features. Extensive experiments conducted on the DAVIS2017-unsupervised and YoutubeVIS19&21 and OIVS datasets demonstrate the superior performance of the proposed approach without mask supervision when compared to existing mask-supervised methods, as well as its capacity to generalize to weakly-annotated video datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111100"},"PeriodicalIF":7.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008513","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/31 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The current state-of-the-art techniques for video object segmentation necessitate extensive training on video datasets with mask annotations, thereby constraining their ability to transfer zero-shot learning to new image distributions and tasks. However, recent advancements in foundation models, particularly in the domain of image segmentation, have showcased robust generalization capabilities, introducing a novel prompt-driven paradigm for a variety of downstream segmentation challenges on new data distributions. This study delves into the potential of vision foundation models using diverse prompt strategies and proposes a mask-free approach for unsupervised video object segmentation. To further improve the efficacy of prompt learning in diverse and complex video scenes, we introduce a spatial–temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features. Extensive experiments conducted on the DAVIS2017-unsupervised and YoutubeVIS19&21 and OIVS datasets demonstrate the superior performance of the proposed approach without mask supervision when compared to existing mask-supervised methods, as well as its capacity to generalize to weakly-annotated video datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MWVOS：通过可提示基础模型进行无掩码弱监督视频对象分割

目前最先进的视频对象分割技术必须在带有掩码注释的视频数据集上进行大量训练，从而限制了它们将零点学习转移到新图像分布和任务中的能力。然而，基础模型的最新进展，尤其是在图像分割领域，展示了强大的泛化能力，为新数据分布上的各种下游分割挑战引入了新颖的提示驱动范式。本研究利用不同的提示策略深入研究了视觉基础模型的潜力，并提出了一种用于无监督视频对象分割的无掩码方法。为了进一步提高提示学习在复杂多样视频场景中的效率，我们引入了一种时空解耦的可变形关注机制，以建立帧内和帧间特征之间的有效关联。在 DAVIS2017-unsupervised 数据集、YoutubeVIS19&21 数据集和 OIVS 数据集上进行的广泛实验表明，与现有的掩码监督方法相比，所提出的方法在没有掩码监督的情况下性能优越，而且还能推广到弱注释视频数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.