Shaocan Liu;Xingtao Wang;Ruiqin Xiong;Xiaopeng Fan
{"title":"GCN-Based Multi-Modality Fusion Network for Action Recognition","authors":"Shaocan Liu;Xingtao Wang;Ruiqin Xiong;Xiaopeng Fan","doi":"10.1109/TMM.2024.3521749","DOIUrl":null,"url":null,"abstract":"Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a <inline-formula><tex-math>$\\mathit{skeleton-like}$</tex-math></inline-formula> (SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1242-1253"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814090/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a $\mathit{skeleton-like}$ (SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.