Action Recognition Using Multi-Scale Temporal Shift Module and Temporal Feature Difference Extraction Based on 2D CNN

Kun-Hsuan Wu, Ching-Te Chiu
{"title":"Action Recognition Using Multi-Scale Temporal Shift Module and Temporal Feature Difference Extraction Based on 2D CNN","authors":"Kun-Hsuan Wu, Ching-Te Chiu","doi":"10.4236/JSEA.2021.145011","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks, which have achieved outstanding performance in image recognition, have been extensively applied to action recognition. The mainstream approaches to video understanding can be categorized into two-dimensional and three-dimensional convolutional neural networks. Although three-dimensional convolutional filters can learn the temporal correlation between different frames by extracting the features of multiple frames simultaneously, it results in an explosive number of parameters and calculation cost. Methods based on two-dimensional convolutional neural networks use fewer parameters; they often incorporate optical flow to compensate for their inability to learn temporal relationships. However, calculating the corresponding optical flow results in additional calculation cost; further, it necessitates the use of another model to learn the features of optical flow. We proposed an action recognition framework based on the two-dimensional convolutional neural network; therefore, it was necessary to resolve the lack of temporal relationships. To expand the temporal receptive field, we proposed a multi-scale temporal shift module, which was then combined with a temporal feature difference extraction module to extract the difference between the features of different frames. Finally, the model was compressed to make it more compact. We evaluated our method on two major action recognition benchmarks: the HMDB51 and UCF-101 datasets. Before compression, the proposed method achieved an accuracy of 72.83% on the HMDB51 dataset and 96.25% on the UCF-101 dataset. Following compression, the accuracy was still impressive, at 95.57% and 72.19% on each dataset. The final model was more compact than most related works.","PeriodicalId":62222,"journal":{"name":"软件工程与应用(英文)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件工程与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/JSEA.2021.145011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Convolutional neural networks, which have achieved outstanding performance in image recognition, have been extensively applied to action recognition. The mainstream approaches to video understanding can be categorized into two-dimensional and three-dimensional convolutional neural networks. Although three-dimensional convolutional filters can learn the temporal correlation between different frames by extracting the features of multiple frames simultaneously, it results in an explosive number of parameters and calculation cost. Methods based on two-dimensional convolutional neural networks use fewer parameters; they often incorporate optical flow to compensate for their inability to learn temporal relationships. However, calculating the corresponding optical flow results in additional calculation cost; further, it necessitates the use of another model to learn the features of optical flow. We proposed an action recognition framework based on the two-dimensional convolutional neural network; therefore, it was necessary to resolve the lack of temporal relationships. To expand the temporal receptive field, we proposed a multi-scale temporal shift module, which was then combined with a temporal feature difference extraction module to extract the difference between the features of different frames. Finally, the model was compressed to make it more compact. We evaluated our method on two major action recognition benchmarks: the HMDB51 and UCF-101 datasets. Before compression, the proposed method achieved an accuracy of 72.83% on the HMDB51 dataset and 96.25% on the UCF-101 dataset. Following compression, the accuracy was still impressive, at 95.57% and 72.19% on each dataset. The final model was more compact than most related works.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于二维CNN的多尺度时移模块和时间特征差异提取的动作识别
卷积神经网络在图像识别方面取得了优异的成绩,在动作识别方面得到了广泛的应用。视频理解的主流方法可以分为二维卷积神经网络和三维卷积神经网络。虽然三维卷积滤波器可以通过同时提取多帧的特征来学习不同帧之间的时间相关性,但它的参数数量和计算代价都非常大。基于二维卷积神经网络的方法使用较少的参数;它们经常结合光流来弥补它们无法学习时间关系的缺陷。但是,计算相应的光流会增加计算成本;此外,需要使用另一种模型来了解光流的特征。提出了一种基于二维卷积神经网络的动作识别框架;因此,有必要解决缺乏时间关系的问题。为了扩展颞感受野,我们提出了一个多尺度颞移模块,然后将其与一个时间特征差异提取模块相结合,提取不同帧的特征之间的差异。最后,对模型进行压缩,使其更加紧凑。我们在两个主要的动作识别基准上评估了我们的方法:HMDB51和UCF-101数据集。压缩前,该方法在HMDB51数据集和UCF-101数据集上的准确率分别为72.83%和96.25%。压缩后,准确率仍然令人印象深刻,在每个数据集上分别为95.57%和72.19%。最终的模型比大多数相关作品更紧凑。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
815
期刊最新文献
Advanced Face Mask Detection Model Using Hybrid Dilation Convolution Based Method Artificial Intelligence and the Sustainable Development Goals: An Exploratory Study in the Context of the Society Domain Guideline of Test Suite Construction for GUI Software Centered on Grey-Box Approach Software Metric Analysis of Open-Source Business Software Research and Implementation of Cancer Gene Data Classification Based on Deep Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1