{"title":"Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning","authors":"Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3514352","DOIUrl":null,"url":null,"abstract":"Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6748-6760"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10804082/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.