{"title":"Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition","authors":"","doi":"10.1016/j.patcog.2024.111106","DOIUrl":null,"url":null,"abstract":"<div><div>The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at <span><span>https://github.com/DingYyue/LG-SGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008574","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at https://github.com/DingYyue/LG-SGNet.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.