{"title":"Manet: motion-aware network for video action recognition","authors":"Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Chen Zhang","doi":"10.1007/s40747-024-01774-9","DOIUrl":null,"url":null,"abstract":"<p>Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.\n</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"8 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01774-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.