Collaboratively Self-Supervised Video Representation Learning for Action Recognition

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Information Forensics and Security Pub Date : 2025-01-20 DOI:10.1109/TIFS.2025.3531772

Jie Zhang;Zhifan Wan;Lanqing Hu;Stephen Lin;Shuzhe Wu;Shiguang Shan

{"title":"Collaboratively Self-Supervised Video Representation Learning for Action Recognition","authors":"Jie Zhang;Zhifan Wan;Lanqing Hu;Stephen Lin;Shuzhe Wu;Shiguang Shan","doi":"10.1109/TIFS.2025.3531772","DOIUrl":null,"url":null,"abstract":"Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"1895-1907"},"PeriodicalIF":8.0000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10847948/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于动作识别的协同自监督视频表示学习

考虑到动作识别与人体姿态估计之间的密切联系，我们设计了一个针对动作识别的协同自监督视频表示（CSVR）学习框架，将生成式姿态预测和判别式上下文匹配作为借口任务。具体来说，我们的CSVR由三个分支组成：生成姿态预测分支，判别上下文匹配分支和视频生成分支。其中，第一个分支利用条件gan对动态运动特征进行编码，预测未来帧的人体姿态；第二个分支通过对比正负视频特征和i帧特征对提取静态上下文特征。第三个分支旨在生成当前和未来的视频帧，以协同改进动态运动特征和静态上下文特征。大量的实验表明，我们的方法在多个流行的视频数据集上达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features