Web视频标注多实例学习中的自适应池

2017 IEEE International Conference on Computer Vision Workshops (ICCVW) Pub Date : 2017-10-22 DOI:10.1109/ICCVW.2017.46

Dong Liu, Y. Zhou, Xiaoyan Sun, Zhengjun Zha, Wenjun Zeng

{"title":"Web视频标注多实例学习中的自适应池","authors":"Dong Liu, Y. Zhou, Xiaoyan Sun, Zhengjun Zha, Wenjun Zeng","doi":"10.1109/ICCVW.2017.46","DOIUrl":null,"url":null,"abstract":"Web videos are usually weakly annotated, i.e., a tag is associated to a video once the corresponding concept appears in a frame of this video without indicating when and where it occurs. These weakly annotated tags pose big troubles to many Web video applications, e.g. search and recommendation. In this paper, we present a new Web video annotation approach based on multi-instance learning (MIL) with a learnable pooling function. By formulating the Web video annotation as a MIL problem, we present an end-to-end deep network framework to solve this problem in which the frame (instance) level annotation is estimated from tags given at the video (bag of instances) level via a convolutional neural network (CNN). A learnable pooling function is proposed to adaptively fuse the outputs of the CNN to determine tags at the video level. We further propose a new loss function that consists of both bag-level and instance-level losses, which enables the penalty term to be aware of the internal state of network rather than only an overall loss, thus makes the pooling function learned better and faster. Experimental results demonstrate that our proposed framework is able to not only enhance the accuracy of Web video annotation by outperforming the state-of-the-art Web video annotation methods on the large-scale video dataset FCVID, but also help to infer the most relevant frames in Web videos.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"Adaptive Pooling in Multi-instance Learning for Web Video Annotation\",\"authors\":\"Dong Liu, Y. Zhou, Xiaoyan Sun, Zhengjun Zha, Wenjun Zeng\",\"doi\":\"10.1109/ICCVW.2017.46\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web videos are usually weakly annotated, i.e., a tag is associated to a video once the corresponding concept appears in a frame of this video without indicating when and where it occurs. These weakly annotated tags pose big troubles to many Web video applications, e.g. search and recommendation. In this paper, we present a new Web video annotation approach based on multi-instance learning (MIL) with a learnable pooling function. By formulating the Web video annotation as a MIL problem, we present an end-to-end deep network framework to solve this problem in which the frame (instance) level annotation is estimated from tags given at the video (bag of instances) level via a convolutional neural network (CNN). A learnable pooling function is proposed to adaptively fuse the outputs of the CNN to determine tags at the video level. We further propose a new loss function that consists of both bag-level and instance-level losses, which enables the penalty term to be aware of the internal state of network rather than only an overall loss, thus makes the pooling function learned better and faster. Experimental results demonstrate that our proposed framework is able to not only enhance the accuracy of Web video annotation by outperforming the state-of-the-art Web video annotation methods on the large-scale video dataset FCVID, but also help to infer the most relevant frames in Web videos.\",\"PeriodicalId\":149766,\"journal\":{\"name\":\"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)\",\"volume\":\"143 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCVW.2017.46\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCVW.2017.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

网络视频通常是弱注释的，即一旦相应的概念出现在视频的一帧中，就将标签与该视频关联起来，而不指示其出现的时间和地点。这些弱注释标签给许多Web视频应用带来了很大的麻烦，例如搜索和推荐。本文提出了一种新的基于多实例学习的Web视频标注方法，该方法具有可学习池化功能。通过将Web视频注释表述为MIL问题，我们提出了一个端到端的深度网络框架来解决这个问题，其中通过卷积神经网络(CNN)从视频(实例包)级别给出的标签估计帧(实例)级别的注释。提出了一种可学习池化函数，用于自适应融合CNN的输出以确定视频级别的标签。我们进一步提出了一种新的损失函数，它由袋级和实例级损失组成，使惩罚项能够意识到网络的内部状态，而不仅仅是整体损失，从而使池化函数学习得更好更快。实验结果表明，我们提出的框架不仅能够在大规模视频数据集FCVID上优于当前最先进的Web视频注释方法，提高Web视频注释的准确性，而且有助于推断Web视频中最相关的帧。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Adaptive Pooling in Multi-instance Learning for Web Video Annotation

Web videos are usually weakly annotated, i.e., a tag is associated to a video once the corresponding concept appears in a frame of this video without indicating when and where it occurs. These weakly annotated tags pose big troubles to many Web video applications, e.g. search and recommendation. In this paper, we present a new Web video annotation approach based on multi-instance learning (MIL) with a learnable pooling function. By formulating the Web video annotation as a MIL problem, we present an end-to-end deep network framework to solve this problem in which the frame (instance) level annotation is estimated from tags given at the video (bag of instances) level via a convolutional neural network (CNN). A learnable pooling function is proposed to adaptively fuse the outputs of the CNN to determine tags at the video level. We further propose a new loss function that consists of both bag-level and instance-level losses, which enables the penalty term to be aware of the internal state of network rather than only an overall loss, thus makes the pooling function learned better and faster. Experimental results demonstrate that our proposed framework is able to not only enhance the accuracy of Web video annotation by outperforming the state-of-the-art Web video annotation methods on the large-scale video dataset FCVID, but also help to infer the most relevant frames in Web videos.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

自引率

0.00%

发文量