您领先，我们超越:联合开发网络视频和图像的无用功视频概念学习

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2016-06-27 DOI:10.1109/CVPR.2016.106

Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, Tao Mei

{"title":"您领先，我们超越:联合开发网络视频和图像的无用功视频概念学习","authors":"Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, Tao Mei","doi":"10.1109/CVPR.2016.106","DOIUrl":null,"url":null,"abstract":"Video concept learning often requires a large set oftraining samples. In practice, however, acquiring noise-free training labels with sufficient positive examples is very expensive. A plausible solution for training data collection is by sampling from the vast quantities of images and videos on the Web. Such a solution is motivated by the assumption that the retrieved images or videos are highly correlated with the query. Still, a number ofchallenges remain. First, Web videos are often untrimmed. Thus, only parts of the videos are relevant to the query. Second, the retrieved Web images are always highly relevant to the issued query. However, thoughtlessly utilizing the images in the video domain may even hurt the performance due to the well-known semantic drift and domain gap problems. As a result, a valid question is how Web images and videos interact for video concept learning. In this paper, we propose a Lead-Exceed Neural Network (LENN), which reinforces the training on Web images and videos in a curriculum manner. Specifically, the training proceeds by inputting frames of Web videos to obtain a network. The Web images are then filtered by the learnt network and the selected images are additionally fed into the network to enhance the architecture and further trim the videos. In addition, Long Short-Term Memory (LSTM) can be applied on the trimmed videos to explore temporal information. Encouraging results are reported on UCFIOl, TRECVID 2013 and 2014 MEDTest in the context ofboth action recognition and event detection. Without using human annotated exemplars, our proposed LENN can achieve 74.4% accuracy on UCFIOI dataset.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"49 1","pages":"923-932"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"108","resultStr":"{\"title\":\"You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images\",\"authors\":\"Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, Tao Mei\",\"doi\":\"10.1109/CVPR.2016.106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video concept learning often requires a large set oftraining samples. In practice, however, acquiring noise-free training labels with sufficient positive examples is very expensive. A plausible solution for training data collection is by sampling from the vast quantities of images and videos on the Web. Such a solution is motivated by the assumption that the retrieved images or videos are highly correlated with the query. Still, a number ofchallenges remain. First, Web videos are often untrimmed. Thus, only parts of the videos are relevant to the query. Second, the retrieved Web images are always highly relevant to the issued query. However, thoughtlessly utilizing the images in the video domain may even hurt the performance due to the well-known semantic drift and domain gap problems. As a result, a valid question is how Web images and videos interact for video concept learning. In this paper, we propose a Lead-Exceed Neural Network (LENN), which reinforces the training on Web images and videos in a curriculum manner. Specifically, the training proceeds by inputting frames of Web videos to obtain a network. The Web images are then filtered by the learnt network and the selected images are additionally fed into the network to enhance the architecture and further trim the videos. In addition, Long Short-Term Memory (LSTM) can be applied on the trimmed videos to explore temporal information. Encouraging results are reported on UCFIOl, TRECVID 2013 and 2014 MEDTest in the context ofboth action recognition and event detection. Without using human annotated exemplars, our proposed LENN can achieve 74.4% accuracy on UCFIOI dataset.\",\"PeriodicalId\":6515,\"journal\":{\"name\":\"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"49 1\",\"pages\":\"923-932\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"108\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR.2016.106\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2016.106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 108

摘要

视频概念学习通常需要大量的训练样本。然而，在实践中，获取具有足够正例的无噪声训练标签是非常昂贵的。训练数据收集的一个可行的解决方案是从Web上大量的图像和视频中采样。这种解决方案的动机是假设检索到的图像或视频与查询高度相关。尽管如此，仍然存在一些挑战。首先，网络视频通常是未经修饰的。因此，只有部分视频与查询相关。其次，检索到的Web图像总是与发出的查询高度相关。然而，由于众所周知的语义漂移和域间隙问题，在视频域中不加考虑地利用图像甚至会损害性能。因此，一个有效的问题是网络图像和视频如何在视频概念学习中相互作用。在本文中，我们提出了一个超前-超越神经网络(LENN)，以课程的方式加强对网络图像和视频的训练。具体来说，训练是通过输入Web视频的帧来获得一个网络。然后，网络对网络图像进行过滤，并将选中的图像馈送到网络中，以增强网络结构并进一步修剪视频。此外，长短期记忆(LSTM)可以应用于修剪后的视频来探索时间信息。UCFIOl、TRECVID 2013和2014 MEDTest在动作识别和事件检测方面都取得了令人鼓舞的结果。在不使用人类标注样本的情况下，我们提出的LENN在UCFIOI数据集上的准确率达到74.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images

Video concept learning often requires a large set oftraining samples. In practice, however, acquiring noise-free training labels with sufficient positive examples is very expensive. A plausible solution for training data collection is by sampling from the vast quantities of images and videos on the Web. Such a solution is motivated by the assumption that the retrieved images or videos are highly correlated with the query. Still, a number ofchallenges remain. First, Web videos are often untrimmed. Thus, only parts of the videos are relevant to the query. Second, the retrieved Web images are always highly relevant to the issued query. However, thoughtlessly utilizing the images in the video domain may even hurt the performance due to the well-known semantic drift and domain gap problems. As a result, a valid question is how Web images and videos interact for video concept learning. In this paper, we propose a Lead-Exceed Neural Network (LENN), which reinforces the training on Web images and videos in a curriculum manner. Specifically, the training proceeds by inputting frames of Web videos to obtain a network. The Web images are then filtered by the learnt network and the selected images are additionally fed into the network to enhance the architecture and further trim the videos. In addition, Long Short-Term Memory (LSTM) can be applied on the trimmed videos to explore temporal information. Encouraging results are reported on UCFIOl, TRECVID 2013 and 2014 MEDTest in the context ofboth action recognition and event detection. Without using human annotated exemplars, our proposed LENN can achieve 74.4% accuracy on UCFIOI dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助