{"title":"Spatio-Temporal Slowfast Self-Attention Network For Action Recognition","authors":"Myeongjun Kim, Taehun Kim, Daijin Kim","doi":"10.1109/ICIP40778.2020.9191290","DOIUrl":null,"url":null,"abstract":"We propose Spatio-Temporal SlowFast Self-Attention network for action recognition. Conventional Convolutional Neural Networks have the advantage of capturing the local area of the data. However, to understand a human action, it is appropriate to consider both human and the overall context of given scene. Therefore, we repurpose a self-attention mechanism from Self-Attention GAN (SAGAN) to our model for retrieving global semantic context when making action recognition. Using the self-attention mechanism, we propose a module that can extract four features in video information: spatial information, temporal information, slow action information, and fast action information. We train and test our network on the Atomic Visual Actions (AVA) dataset and show significant frame-AP improvements on 28 categories.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP40778.2020.9191290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
We propose Spatio-Temporal SlowFast Self-Attention network for action recognition. Conventional Convolutional Neural Networks have the advantage of capturing the local area of the data. However, to understand a human action, it is appropriate to consider both human and the overall context of given scene. Therefore, we repurpose a self-attention mechanism from Self-Attention GAN (SAGAN) to our model for retrieving global semantic context when making action recognition. Using the self-attention mechanism, we propose a module that can extract four features in video information: spatial information, temporal information, slow action information, and fast action information. We train and test our network on the Atomic Visual Actions (AVA) dataset and show significant frame-AP improvements on 28 categories.