一种用于动作识别的高效轻量级时空注意模块

Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition Pub Date : 2022-11-17 DOI:10.1145/3581807.3581810

Zhonghua Sun, Meng Dai, Ziwen Yi, Tianyi Wang, Jinchao Feng, Kebin Jia

{"title":"一种用于动作识别的高效轻量级时空注意模块","authors":"Zhonghua Sun, Meng Dai, Ziwen Yi, Tianyi Wang, Jinchao Feng, Kebin Jia","doi":"10.1145/3581807.3581810","DOIUrl":null,"url":null,"abstract":"Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.","PeriodicalId":292813,"journal":{"name":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition\",\"authors\":\"Zhonghua Sun, Meng Dai, Ziwen Yi, Tianyi Wang, Jinchao Feng, Kebin Jia\",\"doi\":\"10.1145/3581807.3581810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.\",\"PeriodicalId\":292813,\"journal\":{\"name\":\"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3581807.3581810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3581807.3581810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

有效的特征学习是人体动作识别算法的重要组成部分之一。三维卷积神经网络(3D CNN)可以直接提取动作视频的时空特征，但它不足以捕捉动作视频中最具判别性的部分。时间帧内和帧间的冗余空间区域会削弱三维CNN模型的描述能力。为了解决这一问题，我们提出了一个轻量级的时空注意模块(ST-AM)，由空间注意模块(SAM)和时间注意模块(TAM)组成。SAM和TAM可以有效地对语义空间区域进行编码，抑制冗余时间帧，减少误分类。所提出的SAM和TAM具有互补的效果，可以很容易地嵌入到现有的3D CNN动作识别模型中。在UCF-101和HMDB-51数据集上的实验表明，ST-AM嵌入式模型在动作识别任务上取得了令人满意的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition

Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Multi-Scale Channel Attention for Chinese Scene Text Recognition Vehicle Re-identification Based on Multi-Scale Attention Feature Fusion Comparative Study on EEG Feature Recognition based on Deep Belief Network VA-TransUNet: A U-shaped Medical Image Segmentation Network with Visual Attention Traffic Flow Forecasting Research Based on Delay Reconstruction and GRU-SVR