Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

2017 IEEE International Conference on Computer Vision (ICCV) Pub Date : 2017-10-01 DOI:10.1109/ICCV.2017.590

Zhaofan Qiu, Ting Yao, Tao Mei

{"title":"Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks","authors":"Zhaofan Qiu, Ting Yao, Tao Mei","doi":"10.1109/ICCV.2017.590","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x 3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named Pseudo-3D Residual Net (P3D ResNet), that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"7 1","pages":"5534-5542"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1420","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV.2017.590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1420

Abstract

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x 3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named Pseudo-3D Residual Net (P3D ResNet), that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

伪三维残差网络的时空表征学习

卷积神经网络(CNN)被认为是图像识别问题的一个强大的模型类别。然而，当使用CNN来学习时空视频表示时，这并不是微不足道的。一些研究表明，执行3D卷积是捕获视频中空间和时间维度的有益方法。然而，从零开始开发一个非常深的3D CNN会导致昂贵的计算成本和内存需求。一个有效的问题是，为什么不回收现成的2D网络来制作3D CNN。在本文中，我们在残差学习框架中设计了瓶颈构建块的多种变体，通过在空间域上使用1 × 3 × 3卷积滤波器(相当于2D CNN)模拟3 × 3 × 3卷积，再加上3 × 1 × 1卷积，在相邻的特征映射上及时构建时间连接。此外，我们提出了一种名为Pseudo-3D Residual Net (P3D ResNet)的新架构，它利用了所有块的变体，但将每个块组合在ResNet的不同位置，遵循通过深入增强结构多样性可以提高神经网络能力的理念。与3D CNN和基于帧的2D CNN相比，我们的P3D ResNet在Sports-1M视频分类数据集上分别实现了5.3%和1.8%的明显改进。我们进一步研究了我们的预训练P3D ResNet在五个不同的基准和三个不同的任务上产生的视频表示的泛化性能，展示了几种最先进的技术的卓越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量

期刊最新文献

Visual Odometry for Pixel Processor Arrays Rolling Shutter Correction in Manhattan World Sketching with Style: Visual Search with Sketches and Aesthetic Context Active Learning for Human Pose Estimation Attribute-Enhanced Face Recognition with Neural Tensor Fusion Networks