Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-12-16 DOI:10.1109/TIP.2024.3514352

Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang

{"title":"Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning","authors":"Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3514352","DOIUrl":null,"url":null,"abstract":"Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6748-6760"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10804082/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过伪监督学习利用无标记视频进行视频文本检索

大规模预训练的视觉语言模型（如CLIP）在下游任务（如视频文本检索（VTR））中显示出令人难以置信的泛化性能。传统的方法是利用CLIP强大的多模态校准能力，直接微调视觉和文本编码器与干净的视频文本数据。然而，这些技术依赖于仔细注释的视频文本对，这是昂贵的，需要大量的人工努力。在此背景下，我们引入了一种新的方法——伪监督选择性对比学习（PS-SCL）。PS-SCL通过从未标记的视频数据生成用于训练的伪监督，最大限度地减少了对手动标记文本注释的依赖。我们首先利用CLIP的视觉识别能力自动生成伪文本。这些伪文本包含了来自视频的各种视觉概念，并起到了微弱的文本引导作用。此外，我们引入了选择性对比学习（SeLeCT），它从伪监督视频文本对中优先选择高度相关的视频文本对。通过这样做，SeLeCT可以在弱配对监督下更有效地进行多模态学习。实验结果表明，我们的方法在多个视频文本检索基准上大大优于CLIP零射击性能，例如，MSRVTT上的视频到文本分别为8.2% R@1， DiDeMo上的视频到文本分别为12.2% R@1， ActivityNet上的视频到文本分别为10.9% R@1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量