Kun Yan;Fangyun Wei;Shuyu Dai;Minghui Wu;Ping Wang;Chang Xu
{"title":"Low-Shot Video Object Segmentation","authors":"Kun Yan;Fangyun Wei;Shuyu Dai;Minghui Wu;Ping Wang;Chang Xu","doi":"10.1109/TPAMI.2025.3552779","DOIUrl":null,"url":null,"abstract":"Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations—specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5538-5555"},"PeriodicalIF":18.6000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10933555/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations—specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.
以往的视频对象分割研究主要依赖于具有密集注释的视频。然而,获得像素级注释既昂贵又耗时。在这项工作中,我们强调了使用非常稀疏的视频注释有效训练VOS模型的潜力——具体来说,每个训练视频只需一到两个标记帧,但保持接近等效的性能水平。我们将这种创新的训练方法称为low-shot video object segmentation,简称low-shot VOS。该方法的核心是在训练阶段为未标记的帧生成可靠的伪标签,然后将其与标记的帧串联使用以优化模型。值得注意的是,我们的策略非常简单,可以整合到当前绝大多数VOS模型中。本文首次提出了一种基于单次和双次VOS数据集训练VOS模型的通用方法。在两次配置中,分别利用YouTube-VOS和DAVIS基准测试中7.3%和2.9%的标记数据,我们的模型提供的结果与在完全标记数据集上训练的结果相当。同样值得注意的是,在单次设置中,与在完全注释的数据集上训练的模型相比,观察到轻微的性能下降。