使用双向LSTM的未修剪视频中的上下文感知动作检测

2018 15th Conference on Computer and Robot Vision (CRV) Pub Date : 2018-05-01 DOI:10.1109/CRV.2018.00039

Jaideep Singh Chauhan, Yang Wang

{"title":"使用双向LSTM的未修剪视频中的上下文感知动作检测","authors":"Jaideep Singh Chauhan, Yang Wang","doi":"10.1109/CRV.2018.00039","DOIUrl":null,"url":null,"abstract":"We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Context-Aware Action Detection in Untrimmed Videos Using Bidirectional LSTM\",\"authors\":\"Jaideep Singh Chauhan, Yang Wang\",\"doi\":\"10.1109/CRV.2018.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.\",\"PeriodicalId\":281779,\"journal\":{\"name\":\"2018 15th Conference on Computer and Robot Vision (CRV)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 15th Conference on Computer and Robot Vision (CRV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CRV.2018.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th Conference on Computer and Robot Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV.2018.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

我们考虑了未修剪视频中的动作检测问题。我们认为视频中的上下文信息对这项任务很重要。基于这种直觉，我们设计了一个使用双向长短期记忆(Bi-LSTM)模型的网络，该模型可以捕获视频中的上下文信息。我们的模型包括一个改进的损失函数，它强制网络学习动作进展，以及一个反向传播，其中梯度是基于它们在时间尺度上的起源加权的。lstm擅长捕获长时间依赖性，但不擅长建模局部时间特征。在我们的模型中，我们使用3-D卷积神经网络(3-D ConvNet)来捕获视频的局部时空特征。我们对学习视频背景的重要性进行了全面的分析。最后，我们在两个动作检测数据集(ActivityNet和THUMOS’14)上评估了我们的工作。与现有方法相比，我们的方法在这两个数据集上都取得了具有竞争力的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Context-Aware Action Detection in Untrimmed Videos Using Bidirectional LSTM

We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 15th Conference on Computer and Robot Vision (CRV)

自引率

0.00%

发文量