Generalized Pose Decoupled Network for Unsupervised 3D Skeleton Sequence-Based Action Representation Learning.

IF 10.5 Q1 ENGINEERING, BIOMEDICAL Cyborg and bionic systems (Washington, D.C.) Pub Date : 2022-01-01 DOI:10.34133/cbsystems.0002

Mengyuan Liu, Fanyang Meng, Yongsheng Liang

{"title":"Generalized Pose Decoupled Network for Unsupervised 3D Skeleton Sequence-Based Action Representation Learning.","authors":"Mengyuan Liu, Fanyang Meng, Yongsheng Liang","doi":"10.34133/cbsystems.0002","DOIUrl":null,"url":null,"abstract":"<p><p>Human action representation is derived from the description of human shape and motion. The traditional unsupervised 3-dimensional (3D) human action representation learning method uses a recurrent neural network (RNN)-based autoencoder to reconstruct the input pose sequence and then takes the midlevel feature of the autoencoder as representation. Although RNN can implicitly learn a certain amount of motion information, the extracted representation mainly describes the human shape and is insufficient to describe motion information. Therefore, we first present a handcrafted motion feature called pose flow to guide the reconstruction of the autoencoder, whose midlevel feature is expected to describe motion information. The performance is limited as we observe that actions can be distinctive in either motion direction or motion norm. For example, we can distinguish \"sitting down\" and \"standing up\" from motion direction yet distinguish \"running\" and \"jogging\" from motion norm. In these cases, it is difficult to learn distinctive features from pose flow where direction and norm are mixed. To this end, we present an explicit pose decoupled flow network (PDF-E) to learn from direction and norm in a multi-task learning framework, where 1 encoder is used to generate representation and 2 decoders are used to generating direction and norm, respectively. Further, we use reconstructing the input pose sequence as an additional constraint and present a generalized PDF network (PDF-G) to learn both motion and shape information, which achieves state-of-the-art performances on large-scale and challenging 3D action recognition datasets including the NTU RGB+D 60 dataset and NTU RGB+D 120 dataset.</p>","PeriodicalId":72764,"journal":{"name":"Cyborg and bionic systems (Washington, D.C.)","volume":"2022 ","pages":"0002"},"PeriodicalIF":10.5000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10076048/pdf/","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cyborg and bionic systems (Washington, D.C.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34133/cbsystems.0002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 9

Abstract

Human action representation is derived from the description of human shape and motion. The traditional unsupervised 3-dimensional (3D) human action representation learning method uses a recurrent neural network (RNN)-based autoencoder to reconstruct the input pose sequence and then takes the midlevel feature of the autoencoder as representation. Although RNN can implicitly learn a certain amount of motion information, the extracted representation mainly describes the human shape and is insufficient to describe motion information. Therefore, we first present a handcrafted motion feature called pose flow to guide the reconstruction of the autoencoder, whose midlevel feature is expected to describe motion information. The performance is limited as we observe that actions can be distinctive in either motion direction or motion norm. For example, we can distinguish "sitting down" and "standing up" from motion direction yet distinguish "running" and "jogging" from motion norm. In these cases, it is difficult to learn distinctive features from pose flow where direction and norm are mixed. To this end, we present an explicit pose decoupled flow network (PDF-E) to learn from direction and norm in a multi-task learning framework, where 1 encoder is used to generate representation and 2 decoders are used to generating direction and norm, respectively. Further, we use reconstructing the input pose sequence as an additional constraint and present a generalized PDF network (PDF-G) to learn both motion and shape information, which achieves state-of-the-art performances on large-scale and challenging 3D action recognition datasets including the NTU RGB+D 60 dataset and NTU RGB+D 120 dataset.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于无监督三维骨架序列动作表示学习的广义姿态解耦网络。

人体动作表征来源于对人体形状和动作的描述。传统的无监督三维人体动作表征学习方法采用基于递归神经网络(RNN)的自编码器重构输入姿态序列，然后将自编码器的中级特征作为表征。虽然RNN可以隐式学习一定量的运动信息，但提取的表示主要描述人体形状，不足以描述运动信息。因此，我们首先提出了一个称为姿态流的手工运动特征来指导自编码器的重建，其中级特征被期望描述运动信息。由于我们观察到动作在运动方向或运动规范上可能是独特的，因此性能是有限的。例如，我们可以从运动方向上区分“坐下”和“站起来”，从运动规范上区分“跑步”和“慢跑”。在这种情况下，很难从方向和规范混合的姿势流中学习到独特的特征。为此，我们提出了一种明确的姿态解耦流网络(PDF-E)，用于在多任务学习框架中从方向和范数中学习，其中1个编码器用于生成表示，2个解码器分别用于生成方向和范数。此外，我们使用重建输入姿态序列作为附加约束，并提出广义PDF网络(PDF- g)来学习运动和形状信息，该网络在大规模和具有挑战性的3D动作识别数据集(包括NTU RGB+D 60数据集和NTU RGB+D 120数据集)上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊