Dynamic Semantic-Based Spatial-Temporal Graph Convolution Network for Skeleton-Based Human Action Recognition

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-11-19 DOI:10.1109/TIP.2024.3497837

Jianyang Xie;Yanda Meng;Yitian Zhao;Anh Nguyen;Xiaoyun Yang;Yalin Zheng

{"title":"Dynamic Semantic-Based Spatial-Temporal Graph Convolution Network for Skeleton-Based Human Action Recognition","authors":"Jianyang Xie;Yanda Meng;Yitian Zhao;Anh Nguyen;Xiaoyun Yang;Yalin Zheng","doi":"10.1109/TIP.2024.3497837","DOIUrl":null,"url":null,"abstract":"Human action recognition is an essential topic in computer vision and image processing. Graph convolutional networks (GCNs) have attracted significant attention and achieved noteworthy performance in skeleton-based human action recognition tasks. However, most of the previous graph-based works are designed to refine skeleton topology without considering the types of different joints and edges and the occurrence order of the frames. Such a limitation makes them insufficient to represent intrinsic semantic information. Differently, we proposed a dynamic semantic-based spatial-temporal graph convolution network (DS-STGCN) to address the challenge. DS-STGCN has two dynamic semantic modules for spatial and temporal contexts respectively. Specifically, the joints and edge types were encoded in the spatial module implicitly, and the occurrence order of frames was encoded in the temporal module implicitly. Extensive experiments on four datasets including NTU-RGB+D 60(120), Kinetics-400, and FineGYM show that our proposed two semantic modules can bring consistent recognition performance improvement with various backbones. Meanwhile, the proposed DS-STGCN notably surpassed state-of-the-art methods on these datasets. Notably, in the more challenging dataset, such as Kinetics-400, our model significantly outperformed other state-of-the-art GCN-based methods by a large margin. The code has been released at \n<uri>https://github.com/davelailai/DS-STGCN</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6691-6704"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10758404/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human action recognition is an essential topic in computer vision and image processing. Graph convolutional networks (GCNs) have attracted significant attention and achieved noteworthy performance in skeleton-based human action recognition tasks. However, most of the previous graph-based works are designed to refine skeleton topology without considering the types of different joints and edges and the occurrence order of the frames. Such a limitation makes them insufficient to represent intrinsic semantic information. Differently, we proposed a dynamic semantic-based spatial-temporal graph convolution network (DS-STGCN) to address the challenge. DS-STGCN has two dynamic semantic modules for spatial and temporal contexts respectively. Specifically, the joints and edge types were encoded in the spatial module implicitly, and the occurrence order of frames was encoded in the temporal module implicitly. Extensive experiments on four datasets including NTU-RGB+D 60(120), Kinetics-400, and FineGYM show that our proposed two semantic modules can bring consistent recognition performance improvement with various backbones. Meanwhile, the proposed DS-STGCN notably surpassed state-of-the-art methods on these datasets. Notably, in the more challenging dataset, such as Kinetics-400, our model significantly outperformed other state-of-the-art GCN-based methods by a large margin. The code has been released at https://github.com/davelailai/DS-STGCN .

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于语义的动态时空图卷积网络，用于基于骨架的人体动作识别

人体动作识别是计算机视觉和图像处理领域的一个重要课题。图卷积网络（GCNs）在基于骨骼的人体动作识别任务中取得了显著的成绩。然而，以往大多数基于图形的工作都是为了细化骨架拓扑而设计的，没有考虑不同节点和边缘的类型以及框架的出现顺序。这种限制使得它们不足以表示内在的语义信息。不同的是，我们提出了一个基于动态语义的时空图卷积网络（DS-STGCN）来解决这一挑战。DS-STGCN有两个动态语义模块，分别针对空间上下文和时间上下文。其中，节点和边缘类型隐式编码在空间模块中，帧的出现顺序隐式编码在时间模块中。在NTU-RGB+ d60（120）、tics-400和FineGYM 4个数据集上进行的大量实验表明，我们提出的两个语义模块可以在不同主干下带来一致的识别性能提升。同时，拟议的DS-STGCN在这些数据集上明显优于最先进的方法。值得注意的是，在更具挑战性的数据集中，例如Kinetics-400，我们的模型明显优于其他最先进的基于gcn的方法。该代码已在https://github.com/davelailai/DS-STGCN上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量