{"title":"A discriminative multi-modal adaptation neural network model for video action recognition.","authors":"Lei Gao, Kai Liu, Ling Guan","doi":"10.1016/j.neunet.2024.107114","DOIUrl":null,"url":null,"abstract":"<p><p>Research on video-based understanding and learning has attracted widespread interest and has been adopted in various real applications, such as e-healthcare, action recognition, affective computing, to name a few. Amongst them, video-based action recognition is one of the most representative examples. With the advancement of multi-sensory technology, action recognition using multi-modal data has recently drawn wide attention. However, the research community faces new challenges in effectively exploring and utilizing the discriminative and complementary information across different modalities. Although score level fusion approaches have been popularly employed for multi-modal action recognition, they simply add the scores derived separately from different modalities without proper consideration of cross-modality semantics amongst multiple input data sources, invariably causing sub-optimal performance. To address this issue, this paper presents a two-stream heterogeneous network to extract and jointly process complementary features derived from RGB and skeleton modalities, respectively. Then, a discriminative multi-modal adaptation neural network model (DMANNM) is proposed and applied to the heterogeneous network, by integrating statistical machine learning (SML) principles with convolutional neural network (CNN) architecture. In addition, to achieve high recognition accuracy by the generated multi-modal structure, an effective nonlinear classification algorithm is presented in this work. Leveraging the joint strength of SML and CNN architecture, the proposed model forms an adaptive platform for handling datasets of different scales. To demonstrate the effectiveness and the generic nature of the proposed model, we conducted experiments on four popular video-based action recognition datasets with different scales: NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA (N-UCLA), and SYSU. The experimental results show the superiority of the proposed method over state-of-the-art compared.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"185 ","pages":"107114"},"PeriodicalIF":6.0000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.neunet.2024.107114","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Research on video-based understanding and learning has attracted widespread interest and has been adopted in various real applications, such as e-healthcare, action recognition, affective computing, to name a few. Amongst them, video-based action recognition is one of the most representative examples. With the advancement of multi-sensory technology, action recognition using multi-modal data has recently drawn wide attention. However, the research community faces new challenges in effectively exploring and utilizing the discriminative and complementary information across different modalities. Although score level fusion approaches have been popularly employed for multi-modal action recognition, they simply add the scores derived separately from different modalities without proper consideration of cross-modality semantics amongst multiple input data sources, invariably causing sub-optimal performance. To address this issue, this paper presents a two-stream heterogeneous network to extract and jointly process complementary features derived from RGB and skeleton modalities, respectively. Then, a discriminative multi-modal adaptation neural network model (DMANNM) is proposed and applied to the heterogeneous network, by integrating statistical machine learning (SML) principles with convolutional neural network (CNN) architecture. In addition, to achieve high recognition accuracy by the generated multi-modal structure, an effective nonlinear classification algorithm is presented in this work. Leveraging the joint strength of SML and CNN architecture, the proposed model forms an adaptive platform for handling datasets of different scales. To demonstrate the effectiveness and the generic nature of the proposed model, we conducted experiments on four popular video-based action recognition datasets with different scales: NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA (N-UCLA), and SYSU. The experimental results show the superiority of the proposed method over state-of-the-art compared.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.