Harnessing Object and Scene Semantics for Large-Scale Video Understanding

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2016-06-27 DOI:10.1109/CVPR.2016.339

Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, L. Sigal

{"title":"Harnessing Object and Scene Semantics for Large-Scale Video Understanding","authors":"Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, L. Sigal","doi":"10.1109/CVPR.2016.339","DOIUrl":null,"url":null,"abstract":"Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object-and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1","pages":"3112-3121"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"87","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2016.339","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 87

Abstract

Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object-and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用对象和场景语义进行大规模视频理解

大规模动作识别和视频分类是计算机视觉中的重要问题。为了解决这些问题，我们提出了一种新的基于对象和场景的语义融合网络和表示。我们的语义融合网络使用三层神经网络结合了三个信息流:(i)基于帧的低级CNN特征，(ii)来自最先进的大规模CNN对象检测器的对象特征，训练以识别20K个类别，以及(iii)来自最先进的CNN场景检测器的场景特征，训练以识别205个场景。训练后的网络分别在ActivityNet和FCVID两个复杂的大规模数据集上实现了监督活动和视频分类的改进。此外，通过融合网络检查和反向传播信息，可以发现视频类和对象/场景之间的语义关系(相关性)。这些视频类-对象/视频类-场景关系反过来可以用作视频类本身的语义表示。我们通过零镜头动作/视频分类和聚类实验说明了这种语义表示的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

Sketch Me That Shoe Multivariate Regression on the Grassmannian for Predicting Novel Domains How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image Discovering the Physical Parts of an Articulated Object Class from Multiple Videos Simultaneous Optical Flow and Intensity Estimation from an Event Camera