Low-Shot Video Object Segmentation

Kun Yan;Fangyun Wei;Shuyu Dai;Minghui Wu;Ping Wang;Chang Xu
{"title":"Low-Shot Video Object Segmentation","authors":"Kun Yan;Fangyun Wei;Shuyu Dai;Minghui Wu;Ping Wang;Chang Xu","doi":"10.1109/TPAMI.2025.3552779","DOIUrl":null,"url":null,"abstract":"Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations—specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5538-5555"},"PeriodicalIF":18.6000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10933555/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations—specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
低镜头视频对象分割
以往的视频对象分割研究主要依赖于具有密集注释的视频。然而,获得像素级注释既昂贵又耗时。在这项工作中,我们强调了使用非常稀疏的视频注释有效训练VOS模型的潜力——具体来说,每个训练视频只需一到两个标记帧,但保持接近等效的性能水平。我们将这种创新的训练方法称为low-shot video object segmentation,简称low-shot VOS。该方法的核心是在训练阶段为未标记的帧生成可靠的伪标签,然后将其与标记的帧串联使用以优化模型。值得注意的是,我们的策略非常简单,可以整合到当前绝大多数VOS模型中。本文首次提出了一种基于单次和双次VOS数据集训练VOS模型的通用方法。在两次配置中,分别利用YouTube-VOS和DAVIS基准测试中7.3%和2.9%的标记数据,我们的模型提供的结果与在完全标记数据集上训练的结果相当。同样值得注意的是,在单次设置中,与在完全注释的数据集上训练的模型相比,观察到轻微的性能下降。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. Neural Eigenfunctions are Structured Representation Learners. Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency. ASIL: Augmented Structural Information Learning for Deep Graph Clustering in Hyperbolic Space. FC$^{2}$: Fast Co-Clustering With Small-Scale Similarity Graph and Bipartite Graph Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1