Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion

Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild Pub Date : 2018-10-15 DOI:10.1145/3265987.3265991

Heeseung Kwon, Suha Kwak, Minsu Cho

引用次数: 3

Abstract

In this paper, we present a new end-to-end convolutional neural network architecture for video classification, and apply the model to action and scene recognition in untrimmed videos for the Challenge on Comprehensive Video Understanding in the Wild. The proposed architecture takes densely sampled video frames as inputs, and apply a temporal pooling operator inside the network to capture temporal context of the input video. As a result, our architecture outputs distinct video-level features with a set of different temporal pooling operators. Furthermore, we design a multimodal feature fusion model by concatenating our video-level features with those given in the challenge dataset. Experimental results on the challenge dataset demonstrate that the proposed architecture and the multimodal feature fusion approach together achieve outstanding performance in action and scene recognition.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于卷积时间池化网络和多模态特征融合的视频理解

在本文中，我们提出了一种新的端到端卷积神经网络架构用于视频分类，并将该模型应用于未修剪视频的动作和场景识别，以应对“野外综合视频理解挑战”。该架构以密集采样的视频帧作为输入，并在网络内部应用时间池算子来捕获输入视频的时间上下文。因此，我们的架构使用一组不同的时间池操作符输出不同的视频级特征。此外，我们通过将我们的视频级特征与挑战数据集中给出的特征连接起来，设计了一个多模态特征融合模型。在挑战数据集上的实验结果表明，该架构和多模态特征融合方法在动作和场景识别方面取得了优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild

自引率

0.00%

发文量