How and What to Learn: Taxonomizing Self-Supervised Learning for 3D Action Recognition

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2022-01-01 DOI:10.1109/WACV51458.2022.00294

Amor Ben Tanfous, Aimen Zerroug, Drew A. Linsley, Thomas Serre

{"title":"How and What to Learn: Taxonomizing Self-Supervised Learning for 3D Action Recognition","authors":"Amor Ben Tanfous, Aimen Zerroug, Drew A. Linsley, Thomas Serre","doi":"10.1109/WACV51458.2022.00294","DOIUrl":null,"url":null,"abstract":"There are two competing standards for self-supervised learning in action recognition from 3D skeletons. Su et al., 2020 [31] used an auto-encoder architecture and an image reconstruction objective function to achieve state-of-the-art performance on the NTU60 C-View benchmark. Rao et al., 2020 [23] used Contrastive learning in the latent space to achieve state-of-the-art performance on the NTU60 C-Sub benchmark. Here, we reconcile these disparate approaches by developing a taxonomy of self-supervised learning for action recognition. We observe that leading approaches generally use one of two types of objective functions: those that seek to reconstruct the input from a latent representation (\"Attractive\" learning) versus those that also try to maximize the representations distinctiveness (\"Contrastive\" learning). Independently, leading approaches also differ in how they implement these objective functions: there are those that optimize representations in the decoder output space and those which optimize representations in the network’s latent space (encoder output). We find that combining these approaches leads to larger gains in performance and tolerance to transformation than is achievable by any individual method, leading to state-of-the-art performance on three standard action recognition datasets. We include links to our code and data.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV51458.2022.00294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

There are two competing standards for self-supervised learning in action recognition from 3D skeletons. Su et al., 2020 [31] used an auto-encoder architecture and an image reconstruction objective function to achieve state-of-the-art performance on the NTU60 C-View benchmark. Rao et al., 2020 [23] used Contrastive learning in the latent space to achieve state-of-the-art performance on the NTU60 C-Sub benchmark. Here, we reconcile these disparate approaches by developing a taxonomy of self-supervised learning for action recognition. We observe that leading approaches generally use one of two types of objective functions: those that seek to reconstruct the input from a latent representation ("Attractive" learning) versus those that also try to maximize the representations distinctiveness ("Contrastive" learning). Independently, leading approaches also differ in how they implement these objective functions: there are those that optimize representations in the decoder output space and those which optimize representations in the network’s latent space (encoder output). We find that combining these approaches leads to larger gains in performance and tolerance to transformation than is achievable by any individual method, leading to state-of-the-art performance on three standard action recognition datasets. We include links to our code and data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

如何学习和学习什么:用于3D动作识别的分类自监督学习

在3D骨骼的动作识别中，有两种相互竞争的自我监督学习标准。Su等人，2020[31]使用自编码器架构和图像重建目标函数在NTU60 C-View基准上实现了最先进的性能。Rao等人，2020[23]使用潜在空间中的对比学习在NTU60 C-Sub基准上实现了最先进的性能。在这里，我们通过开发一种用于动作识别的自监督学习分类法来调和这些不同的方法。我们观察到，领先的方法通常使用两种类型的目标函数之一:那些试图从潜在表征(“吸引”学习)中重建输入的方法，以及那些也试图最大化表征独特性的方法(“对比”学习)。独立地，领先的方法在如何实现这些目标函数方面也有所不同:有些方法在解码器输出空间中优化表示，有些方法在网络的潜在空间(编码器输出)中优化表示。我们发现，与任何单独的方法相比，结合这些方法可以在性能和转换容忍度方面获得更大的收益，从而在三个标准动作识别数据集上获得最先进的性能。我们包含了代码和数据的链接。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量