Xiaotian Wang , Kai Chen , Zhifu Zhao , Guangming Shi , Xuemei Xie , Xiang Jiang , Yifan Yang
{"title":"Multi-Scale Adaptive Skeleton Transformer for action recognition","authors":"Xiaotian Wang , Kai Chen , Zhifu Zhao , Guangming Shi , Xuemei Xie , Xiang Jiang , Yifan Yang","doi":"10.1016/j.cviu.2024.104229","DOIUrl":null,"url":null,"abstract":"<div><div>Transformer has demonstrated remarkable performance in various computer vision tasks. However, its potential is not fully explored in skeleton-based action recognition. On one hand, existing methods primarily utilize fixed function or pre-learned matrix to encode position information, while overlooking the sample-specific position information. On the other hand, these approaches focus on single-scale spatial relationships, while neglecting the discriminative fine-grained and coarse-grained spatial features. To address these issues, we propose a Multi-Scale Adaptive Skeleton Transformer (MSAST), including Adaptive Skeleton Position Encoding Module (ASPEM), Multi-Scale Embedding Module (MSEM), and Adaptive Relative Location Module (ARLM). ASPEM decouples spatial–temporal information in the position encoding procedure, which acquires inherent dependencies of skeleton sequences. ASPEM is also designed to be dependent on input tokens, which can learn sample-specific position information. The MSEM employs multi-scale pooling to generate multi-scale tokens that contain multi-grained features. Then, the spatial transformer captures multi-scale relations to address the subtle differences between various actions. Another contribution of this paper is that ARLM is presented to mine suitable location information for better recognition performance. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model achieves Top-1 accuracy of 94.9%/97.5% on NTU-60 C-Sub/C-View, 88.7%/91.6% on NTU-120 X-Sub/X-Set and 97.4% on NW-UCLA, respectively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104229"},"PeriodicalIF":4.3000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003102","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Transformer has demonstrated remarkable performance in various computer vision tasks. However, its potential is not fully explored in skeleton-based action recognition. On one hand, existing methods primarily utilize fixed function or pre-learned matrix to encode position information, while overlooking the sample-specific position information. On the other hand, these approaches focus on single-scale spatial relationships, while neglecting the discriminative fine-grained and coarse-grained spatial features. To address these issues, we propose a Multi-Scale Adaptive Skeleton Transformer (MSAST), including Adaptive Skeleton Position Encoding Module (ASPEM), Multi-Scale Embedding Module (MSEM), and Adaptive Relative Location Module (ARLM). ASPEM decouples spatial–temporal information in the position encoding procedure, which acquires inherent dependencies of skeleton sequences. ASPEM is also designed to be dependent on input tokens, which can learn sample-specific position information. The MSEM employs multi-scale pooling to generate multi-scale tokens that contain multi-grained features. Then, the spatial transformer captures multi-scale relations to address the subtle differences between various actions. Another contribution of this paper is that ARLM is presented to mine suitable location information for better recognition performance. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model achieves Top-1 accuracy of 94.9%/97.5% on NTU-60 C-Sub/C-View, 88.7%/91.6% on NTU-120 X-Sub/X-Set and 97.4% on NW-UCLA, respectively.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems