A discriminative multi-modal adaptation neural network model for video action recognition

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2025-05-01 Epub Date: 2025-01-03 DOI:10.1016/j.neunet.2024.107114

Lei Gao, Kai Liu, Ling Guan

{"title":"A discriminative multi-modal adaptation neural network model for video action recognition","authors":"Lei Gao, Kai Liu, Ling Guan","doi":"10.1016/j.neunet.2024.107114","DOIUrl":null,"url":null,"abstract":"<div><div>Research on video-based understanding and learning has attracted widespread interest and has been adopted in various real applications, such as e-healthcare, action recognition, affective computing, to name a few. Amongst them, video-based action recognition is one of the most representative examples. With the advancement of multi-sensory technology, action recognition using multi-modal data has recently drawn wide attention. However, the research community faces new challenges in effectively exploring and utilizing the discriminative and complementary information across different modalities. Although score level fusion approaches have been popularly employed for multi-modal action recognition, they simply add the scores derived separately from different modalities without proper consideration of cross-modality semantics amongst multiple input data sources, invariably causing sub-optimal performance. To address this issue, this paper presents a two-stream heterogeneous network to extract and jointly process complementary features derived from RGB and skeleton modalities, respectively. Then, a discriminative multi-modal adaptation neural network model (DMANNM) is proposed and applied to the heterogeneous network, by integrating statistical machine learning (SML) principles with convolutional neural network (CNN) architecture. In addition, to achieve high recognition accuracy by the generated multi-modal structure, an effective nonlinear classification algorithm is presented in this work. Leveraging the joint strength of SML and CNN architecture, the proposed model forms an adaptive platform for handling datasets of different scales. To demonstrate the effectiveness and the generic nature of the proposed model, we conducted experiments on four popular video-based action recognition datasets with different scales: NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA (N-UCLA), and SYSU. The experimental results show the superiority of the proposed method over state-of-the-art compared.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"185 ","pages":"Article 107114"},"PeriodicalIF":6.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024010438","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Research on video-based understanding and learning has attracted widespread interest and has been adopted in various real applications, such as e-healthcare, action recognition, affective computing, to name a few. Amongst them, video-based action recognition is one of the most representative examples. With the advancement of multi-sensory technology, action recognition using multi-modal data has recently drawn wide attention. However, the research community faces new challenges in effectively exploring and utilizing the discriminative and complementary information across different modalities. Although score level fusion approaches have been popularly employed for multi-modal action recognition, they simply add the scores derived separately from different modalities without proper consideration of cross-modality semantics amongst multiple input data sources, invariably causing sub-optimal performance. To address this issue, this paper presents a two-stream heterogeneous network to extract and jointly process complementary features derived from RGB and skeleton modalities, respectively. Then, a discriminative multi-modal adaptation neural network model (DMANNM) is proposed and applied to the heterogeneous network, by integrating statistical machine learning (SML) principles with convolutional neural network (CNN) architecture. In addition, to achieve high recognition accuracy by the generated multi-modal structure, an effective nonlinear classification algorithm is presented in this work. Leveraging the joint strength of SML and CNN architecture, the proposed model forms an adaptive platform for handling datasets of different scales. To demonstrate the effectiveness and the generic nature of the proposed model, we conducted experiments on four popular video-based action recognition datasets with different scales: NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA (N-UCLA), and SYSU. The experimental results show the superiority of the proposed method over state-of-the-art compared.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视频动作识别的判别多模态自适应神经网络模型。

基于视频的理解和学习的研究已经引起了广泛的兴趣，并已被应用于各种实际应用中，如电子医疗、动作识别、情感计算等。其中，基于视频的动作识别是最具代表性的例子之一。随着多感官技术的发展，基于多模态数据的动作识别近年来受到广泛关注。然而，如何有效地探索和利用不同模式下的区别性和互补性信息，面临着新的挑战。尽管分数水平融合方法已被广泛用于多模态动作识别，但它们只是简单地将不同模态分别得出的分数相加，而没有适当考虑多个输入数据源之间的跨模态语义，总是导致性能次优。为了解决这一问题，本文提出了一种两流异构网络，分别从RGB和骨架模式中提取和联合处理互补特征。然后，将统计机器学习（SML）原理与卷积神经网络（CNN）结构相结合，提出了一种判别式多模态自适应神经网络模型（DMANNM），并将其应用于异构网络。此外，为了利用生成的多模态结构达到较高的识别精度，本文提出了一种有效的非线性分类算法。该模型利用SML和CNN架构的联合优势，形成了一个处理不同规模数据集的自适应平台。为了证明该模型的有效性和通用性，我们在四个流行的基于视频的动作识别数据集上进行了实验，这些数据集具有不同的尺度：NTU RGB+D， NTU RGB+D 120，西北加州大学洛杉矶分校（N-UCLA）和SYSU。实验结果表明，该方法与现有方法相比具有优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.