M2A: Motion Aware Attention for Accurate Video Action Recognition

Brennan Gebotys, Alexander Wong, David A Clausi
{"title":"M2A: Motion Aware Attention for Accurate Video Action Recognition","authors":"Brennan Gebotys, Alexander Wong, David A Clausi","doi":"10.1109/CRV55824.2022.00019","DOIUrl":null,"url":null,"abstract":"Advancements in attention mechanisms have led to significant performance improvements in a variety of areas in machine learning due to its ability to enable the dynamic modeling of temporal sequences. A particular area in computer vision that is likely to benefit greatly from the incorporation of attention mechanisms in video action recognition. However, much of the current research's focus on attention mechanisms have been on spatial and temporal attention, which are unable to take advantage of the inherent motion found in videos. Motivated by this, we develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteris-tics. More specifically, M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos. The proposed M2A mechanism is simple to implement and can be easily incorporated into any neural network backbone architecture. We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a $+15\\%$ to $+26\\%$ improvement in top-1 accuracy across different backbone architectures, with only a small in-crease in computational complexity. We further compared the performance of M2A with other state-of-the-art motion and at-tention mechanisms on the Something-Something V1 video action recognition benchmark. Experimental results showed that M2A can lead to further improvements when combined with other temporal mechanisms and that it outperforms other motion-only or attention-only mechanisms by as much as $+60\\%$ in top-1 accuracy for specific classes in the benchmark. We make our code available at: https://github.com/gebob19/M2A.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 19th Conference on Robots and Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV55824.2022.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Advancements in attention mechanisms have led to significant performance improvements in a variety of areas in machine learning due to its ability to enable the dynamic modeling of temporal sequences. A particular area in computer vision that is likely to benefit greatly from the incorporation of attention mechanisms in video action recognition. However, much of the current research's focus on attention mechanisms have been on spatial and temporal attention, which are unable to take advantage of the inherent motion found in videos. Motivated by this, we develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteris-tics. More specifically, M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos. The proposed M2A mechanism is simple to implement and can be easily incorporated into any neural network backbone architecture. We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a $+15\%$ to $+26\%$ improvement in top-1 accuracy across different backbone architectures, with only a small in-crease in computational complexity. We further compared the performance of M2A with other state-of-the-art motion and at-tention mechanisms on the Something-Something V1 video action recognition benchmark. Experimental results showed that M2A can lead to further improvements when combined with other temporal mechanisms and that it outperforms other motion-only or attention-only mechanisms by as much as $+60\%$ in top-1 accuracy for specific classes in the benchmark. We make our code available at: https://github.com/gebob19/M2A.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
M2A:精确视频动作识别的动作感知注意力
由于注意机制能够实现时间序列的动态建模,因此在机器学习的各个领域中,注意机制的进步导致了显著的性能改进。计算机视觉中的一个特殊领域,可能会从视频动作识别中加入注意机制中获益良多。然而,目前对注意力机制的研究大多集中在空间和时间注意力上,这无法利用视频中固有的运动。受此启发,我们开发了一种新的注意力机制,称为运动感知注意力(M2A),明确地结合了运动特征。更具体地说,M2A提取连续帧之间的运动信息,并利用注意力集中在跨帧发现的运动模式上,以准确识别视频中的动作。提出的M2A机制实现简单,可以很容易地集成到任何神经网络骨干架构中。我们表明,使用所提出的M2A机制将运动机制与注意力机制结合起来,可以在不同的骨干架构中导致top-1精度的+ 15%到+ 26%的提高,而计算复杂性只有很小的增加。我们进一步在Something-Something V1视频动作识别基准上比较了M2A与其他最先进的运动和注意力机制的性能。实验结果表明,当与其他时间机制相结合时,M2A可以带来进一步的改进,并且在基准测试中的特定类别中,它比其他仅运动或仅注意机制在前1的准确率上高出高达60%。我们在https://github.com/gebob19/M2A上提供了我们的代码。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A View Invariant Human Action Recognition System for Noisy Inputs TemporalNet: Real-time 2D-3D Video Object Detection Occluded Text Detection and Recognition in the Wild Anomaly Detection with Adversarially Learned Perturbations of Latent Space Occlusion-Aware Self-Supervised Stereo Matching with Confidence Guided Raw Disparity Fusion
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1