2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文中文

Weakly Supervised Actor-Action Segmentation via Robust Multi-task Ranking 基于鲁棒多任务排序的弱监督动作分割

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.115

Yan Yan, Chenliang Xu, Dawen Cai, Jason J. Corso

Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely labeled data, and they fail to capture any internal relationship among actors and actions. To address these issues, in this paper, we propose a novel, robust multi-task ranking model for weakly supervised actor-action segmentation where only video-level tags are given for training samples. Our model is able to share useful information among different actors and actions while learning a ranking matrix to select representative supervoxels for actors and actions respectively. Final segmentation results are generated by a conditional random field that considers various ranking scores for different video parts. Extensive experimental results on the Actor-Action Dataset (A2D) demonstrate that the proposed approach outperforms the state-of-the-art weakly supervised methods and performs as well as the top-performing fully supervised method.

视频中的细粒度活动理解最近引起了相当大的关注，从动作分类转向详细的演员和动作理解，为尖端自主系统的感知需求提供了令人信服的结果。然而，目前用于详细了解行动者和动作的方法有很大的局限性:它们需要大量精细标记的数据，并且它们无法捕获行动者和动作之间的任何内部关系。为了解决这些问题，在本文中，我们提出了一种新的，鲁棒的多任务排序模型，用于弱监督的演员-动作分割，其中仅为训练样本提供视频级别的标签。我们的模型能够在不同的演员和动作之间共享有用的信息，同时学习一个排序矩阵来分别为演员和动作选择有代表性的超体素。最后的分割结果由一个条件随机场生成，该随机场考虑了不同视频部分的各种排名分数。在Actor-Action数据集(A2D)上的大量实验结果表明，所提出的方法优于最先进的弱监督方法，并且与性能最好的完全监督方法一样好。

{"title":"Weakly Supervised Actor-Action Segmentation via Robust Multi-task Ranking","authors":"Yan Yan, Chenliang Xu, Dawen Cai, Jason J. Corso","doi":"10.1109/CVPR.2017.115","DOIUrl":"https://doi.org/10.1109/CVPR.2017.115","url":null,"abstract":"Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely labeled data, and they fail to capture any internal relationship among actors and actions. To address these issues, in this paper, we propose a novel, robust multi-task ranking model for weakly supervised actor-action segmentation where only video-level tags are given for training samples. Our model is able to share useful information among different actors and actions while learning a ranking matrix to select representative supervoxels for actors and actions respectively. Final segmentation results are generated by a conditional random field that considers various ranking scores for different video parts. Extensive experimental results on the Actor-Action Dataset (A2D) demonstrate that the proposed approach outperforms the state-of-the-art weakly supervised methods and performs as well as the top-performing fully supervised method.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"25 1","pages":"1022-1031"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74149914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Surface Motion Capture Transfer with Gaussian Process Regression 表面运动捕捉传输与高斯过程回归

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.379

A. Boukhayma, Jean-Sébastien Franco, Edmond Boyer

We address the problem of transferring motion between captured 4D models. We particularly focus on human subjects for which the ability to automatically augment 4D datasets, by propagating movements between subjects, is of interest in a great deal of recent vision applications that builds on human visual corpus. Given 4D training sets for two subjects for which a sparse set of corresponding keyposes are known, our method is able to transfer a newly captured motion from one subject to the other. With the aim to generalize transfers to input motions possibly very diverse with respect to the training sets, the method contributes with a new transfer model based on non-linear pose interpolation. Building on Gaussian process regression, this model intends to capture and preserve individual motion properties, and thereby realism, by accounting for pose inter-dependencies during motion transfers. Our experiments show visually qualitative, and quantitative, improvements over existing pose-mapping methods and confirm the generalization capabilities of our method compared to state of the art.

我们解决了在捕获的4D模型之间传递运动的问题。我们特别关注人类受试者，通过在受试者之间传播运动来自动增强4D数据集的能力，在最近建立在人类视觉语料库上的大量视觉应用中很感兴趣。给定两个主题的4D训练集，其中对应键位的稀疏集已知，我们的方法能够将新捕获的运动从一个主题转移到另一个主题。该方法提出了一种新的基于非线性位姿插值的迁移模型，旨在将迁移推广到相对于训练集可能非常多样化的输入运动。在高斯过程回归的基础上，该模型旨在通过在运动转移过程中考虑姿态相互依赖关系来捕获和保存单个运动属性，从而实现现实性。我们的实验在视觉上定性和定量地展示了对现有姿态映射方法的改进，并证实了我们的方法与最先进的方法相比的泛化能力。

引用次数: 13

Latent Multi-view Subspace Clustering 潜在多视图子空间聚类

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.461

Changqing Zhang, Q. Hu, H. Fu, Peng Fei Zhu, Xiaochun Cao

In this paper, we propose a novel Latent Multi-view Subspace Clustering (LMSC) method, which clusters data points with latent representation and simultaneously explores underlying complementary information from multiple views. Unlike most existing single view subspace clustering methods that reconstruct data points using original features, our method seeks the underlying latent representation and simultaneously performs data reconstruction based on the learned latent representation. With the complementarity of multiple views, the latent representation could depict data themselves more comprehensively than each single view individually, accordingly makes subspace representation more accurate and robust as well. The proposed method is intuitive and can be optimized efficiently by using the Augmented Lagrangian Multiplier with Alternating Direction Minimization (ALM-ADM) algorithm. Extensive experiments on benchmark datasets have validated the effectiveness of our proposed method.

本文提出了一种新的潜在多视图子空间聚类(LMSC)方法，该方法将具有潜在表示的数据点聚类，同时从多个视图中挖掘潜在的互补信息。与大多数现有的单视图子空间聚类方法使用原始特征重构数据点不同，我们的方法寻求潜在的潜在表示，同时基于学习到的潜在表示进行数据重构。由于多个视图的互补性，潜在表示可以比单个视图更全面地描述数据本身，从而使子空间表示更加准确和鲁棒。该方法直观，并可通过增广拉格朗日乘法器交替方向最小化(ALM-ADM)算法进行高效优化。在基准数据集上的大量实验验证了我们提出的方法的有效性。

引用次数: 314

Direct Photometric Alignment by Mesh Deformation 通过网格变形直接光度对准

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.289

Kaimo Lin, Nianjuan Jiang, Shuaicheng Liu, L. Cheong, M. Do, Jiangbo Lu

The choice of motion models is vital in applications like image/video stitching and video stabilization. Conventional methods explored different approaches ranging from simple global parametric models to complex per-pixel optical flow. Mesh-based warping methods achieve a good balance between computational complexity and model flexibility. However, they typically require high quality feature correspondences and suffer from mismatches and low-textured image content. In this paper, we propose a mesh-based photometric alignment method that minimizes pixel intensity difference instead of Euclidean distance of known feature correspondences. The proposed method combines the superior performance of dense photometric alignment with the efficiency of mesh-based image warping. It achieves better global alignment quality than the feature-based counterpart in textured images, and more importantly, it is also robust to low-textured image content. Abundant experiments show that our method can handle a variety of images and videos, and outperforms representative state-of-the-art methods in both image stitching and video stabilization tasks.

运动模型的选择在图像/视频拼接和视频稳定等应用中至关重要。传统的方法探索了从简单的全局参数模型到复杂的逐像素光流的不同方法。基于网格的翘曲方法在计算复杂度和模型灵活性之间取得了很好的平衡。然而，它们通常需要高质量的特征对应，并遭受不匹配和低纹理的图像内容。在本文中，我们提出了一种基于网格的光度对齐方法，该方法可以最大限度地减少像素强度差异，而不是已知特征对应的欧几里得距离。该方法结合了密集光度对准的优越性能和基于网格的图像翘曲效率。该方法在纹理图像中比基于特征的方法获得了更好的全局对齐质量，更重要的是，它对低纹理图像内容也具有鲁棒性。大量的实验表明，我们的方法可以处理各种图像和视频，并且在图像拼接和视频稳定任务方面都优于代表性的最新方法。

{"title":"Direct Photometric Alignment by Mesh Deformation","authors":"Kaimo Lin, Nianjuan Jiang, Shuaicheng Liu, L. Cheong, M. Do, Jiangbo Lu","doi":"10.1109/CVPR.2017.289","DOIUrl":"https://doi.org/10.1109/CVPR.2017.289","url":null,"abstract":"The choice of motion models is vital in applications like image/video stitching and video stabilization. Conventional methods explored different approaches ranging from simple global parametric models to complex per-pixel optical flow. Mesh-based warping methods achieve a good balance between computational complexity and model flexibility. However, they typically require high quality feature correspondences and suffer from mismatches and low-textured image content. In this paper, we propose a mesh-based photometric alignment method that minimizes pixel intensity difference instead of Euclidean distance of known feature correspondences. The proposed method combines the superior performance of dense photometric alignment with the efficiency of mesh-based image warping. It achieves better global alignment quality than the feature-based counterpart in textured images, and more importantly, it is also robust to low-textured image content. Abundant experiments show that our method can handle a variety of images and videos, and outperforms representative state-of-the-art methods in both image stitching and video stabilization tasks.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"19 1","pages":"2701-2709"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90813396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network MDNet:一个语义和视觉可解释的医学图像诊断网络

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-08 DOI: 10.1109/CVPR.2017.378

Zizhao Zhang, Yuanpu Xie, F. Xing, M. McGough, L. Yang

The inability to interpret the model prediction in semantically and visually meaningful ways is a well-known shortcoming of most existing computer-aided diagnosis methods. In this paper, we propose MDNet to establish a direct multimodal mapping between medical images and diagnostic reports that can read images, generate diagnostic reports, retrieve images by symptom descriptions, and visualize attention, to provide justifications of the network diagnosis process. MDNet includes an image model and a language model. The image model is proposed to enhance multi-scale feature ensembles and utilization efficiency. The language model, integrated with our improved attention mechanism, aims to read and explore discriminative image feature descriptions from reports to learn a direct mapping from sentence words to image pixels. The overall network is trained end-to-end by using our developed optimization strategy. Based on a pathology bladder cancer images and its diagnostic reports (BCIDR) dataset, we conduct sufficient experiments to demonstrate that MDNet outperforms comparative baselines. The proposed image model obtains state-of-the-art performance on two CIFAR datasets as well.

不能以语义和视觉上有意义的方式解释模型预测是大多数现有计算机辅助诊断方法的一个众所周知的缺点。在本文中，我们提出MDNet在医学图像和诊断报告之间建立直接的多模态映射，可以读取图像，生成诊断报告，通过症状描述检索图像，并可视化注意力，以提供网络诊断过程的理由。MDNet包括一个图像模型和一个语言模型。为了提高多尺度特征集合和利用效率，提出了图像模型。该语言模型与我们改进的注意机制相结合，旨在从报告中阅读和探索判别图像特征描述，以学习从句子单词到图像像素的直接映射。通过使用我们开发的优化策略，对整个网络进行端到端的训练。基于病理膀胱癌图像及其诊断报告(bidr)数据集，我们进行了足够的实验来证明MDNet优于比较基线。所提出的图像模型在两个CIFAR数据集上也获得了最先进的性能。

{"title":"MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network","authors":"Zizhao Zhang, Yuanpu Xie, F. Xing, M. McGough, L. Yang","doi":"10.1109/CVPR.2017.378","DOIUrl":"https://doi.org/10.1109/CVPR.2017.378","url":null,"abstract":"The inability to interpret the model prediction in semantically and visually meaningful ways is a well-known shortcoming of most existing computer-aided diagnosis methods. In this paper, we propose MDNet to establish a direct multimodal mapping between medical images and diagnostic reports that can read images, generate diagnostic reports, retrieve images by symptom descriptions, and visualize attention, to provide justifications of the network diagnosis process. MDNet includes an image model and a language model. The image model is proposed to enhance multi-scale feature ensembles and utilization efficiency. The language model, integrated with our improved attention mechanism, aims to read and explore discriminative image feature descriptions from reports to learn a direct mapping from sentence words to image pixels. The overall network is trained end-to-end by using our developed optimization strategy. Based on a pathology bladder cancer images and its diagnostic reports (BCIDR) dataset, we conduct sufficient experiments to demonstrate that MDNet outperforms comparative baselines. The proposed image model obtains state-of-the-art performance on two CIFAR datasets as well.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"47 1","pages":"3549-3557"},"PeriodicalIF":0.0,"publicationDate":"2017-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78418531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 261

RON: Reverse Connection with Objectness Prior Networks for Object Detection 罗恩:反向连接与对象先验网络的对象检测

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-06 DOI: 10.1109/CVPR.2017.557

Tao Kong, F. Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen

We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384×384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3 times faster than the Faster R-CNN counterpart. Code will be made publicly available.

提出了一种高效的通用目标检测框架RON。我们的动机是巧妙地将基于区域(例如，Faster R-CNN)和无区域(例如，SSD)的最佳方法联系起来。在全卷积架构下，RON主要关注两个基本问题:(a)多尺度目标定位和(b)负样本挖掘。为了解决(a)，我们设计了反向连接，使网络能够检测多层cnn上的对象。为了处理(b)，我们提出了客体性优先，以显著减少对象的搜索空间。我们通过多任务损失函数共同优化反向连接、对象先验和对象检测器，从而RON可以直接预测各种特征图的所有位置的最终检测结果。在具有挑战性的PASCAL VOC 2007, PASCAL VOC 2012和MS COCO基准上进行的大量实验证明了RON的竞争性能。具体来说，使用VGG-16和低分辨率384 - 384输入大小，网络在PASCAL VOC 2007数据集上得到81.3%的mAP，在PASCAL VOC 2012数据集上得到80.7%的mAP。MS COCO数据集的结果表明，当数据集变得更大、更困难时，其优势就会增加。在测试阶段使用1.5G GPU内存，网络速度为15 FPS，比更快的R-CNN快3倍。代码将公开提供。

{"title":"RON: Reverse Connection with Objectness Prior Networks for Object Detection","authors":"Tao Kong, F. Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen","doi":"10.1109/CVPR.2017.557","DOIUrl":"https://doi.org/10.1109/CVPR.2017.557","url":null,"abstract":"We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384×384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3 times faster than the Faster R-CNN counterpart. Code will be made publicly available.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"5244-5252"},"PeriodicalIF":0.0,"publicationDate":"2017-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82331932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 384

Benchmarking Denoising Algorithms with Real Photographs 真实照片的基准去噪算法

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-05 DOI: 10.1109/CVPR.2017.294

Tobias Plötz, S. Roth

Lacking realistic ground truth data, image denoising techniques are traditionally evaluated on images corrupted by synthesized i.i.d. Gaussian noise. We aim to obviate this unrealistic setting by developing a methodology for benchmarking denoising techniques on real photographs. We capture pairs of images with different ISO values and appropriately adjusted exposure times, where the nearly noise-free low-ISO image serves as reference. To derive the ground truth, careful post-processing is needed. We correct spatial misalignment, cope with inaccuracies in the exposure parameters through a linear intensity transform based on a novel heteroscedastic Tobit regression model, and remove residual low-frequency bias that stems, e.g., from minor illumination changes. We then capture a novel benchmark dataset, the Darmstadt Noise Dataset (DND), with consumer cameras of differing sensor sizes. One interesting finding is that various recent techniques that perform well on synthetic noise are clearly outperformed by BM3D on photographs with real noise. Our benchmark delineates realistic evaluation scenarios that deviate strongly from those commonly used in the scientific literature.

由于缺乏真实的真实数据，传统的图像去噪技术是对被合成的高斯噪声破坏的图像进行评估的。我们的目标是通过开发一种对真实照片进行基准去噪技术的方法来避免这种不现实的设置。我们用不同的ISO值和适当调整的曝光时间捕捉成对的图像，其中几乎无噪点的低ISO图像作为参考。为了得出基本的真实值，需要进行仔细的后处理。我们通过基于异方差Tobit回归模型的线性强度变换来校正空间错位，处理曝光参数的不准确性，并去除由微小光照变化等引起的残留低频偏置。然后，我们用不同传感器尺寸的消费者相机捕获了一个新的基准数据集，即达姆施塔特噪声数据集(DND)。一个有趣的发现是，最近各种在合成噪声上表现良好的技术在具有真实噪声的照片上明显优于BM3D。我们的基准描述了与科学文献中常用的评估方案严重偏离的现实评估方案。

{"title":"Benchmarking Denoising Algorithms with Real Photographs","authors":"Tobias Plötz, S. Roth","doi":"10.1109/CVPR.2017.294","DOIUrl":"https://doi.org/10.1109/CVPR.2017.294","url":null,"abstract":"Lacking realistic ground truth data, image denoising techniques are traditionally evaluated on images corrupted by synthesized i.i.d. Gaussian noise. We aim to obviate this unrealistic setting by developing a methodology for benchmarking denoising techniques on real photographs. We capture pairs of images with different ISO values and appropriately adjusted exposure times, where the nearly noise-free low-ISO image serves as reference. To derive the ground truth, careful post-processing is needed. We correct spatial misalignment, cope with inaccuracies in the exposure parameters through a linear intensity transform based on a novel heteroscedastic Tobit regression model, and remove residual low-frequency bias that stems, e.g., from minor illumination changes. We then capture a novel benchmark dataset, the Darmstadt Noise Dataset (DND), with consumer cameras of differing sensor sizes. One interesting finding is that various recent techniques that perform well on synthetic noise are clearly outperformed by BM3D on photographs with real noise. Our benchmark delineates realistic evaluation scenarios that deviate strongly from those commonly used in the scientific literature.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"57 1","pages":"2750-2759"},"PeriodicalIF":0.0,"publicationDate":"2017-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72678157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 466

Discover and Learn New Objects from Documentaries 从纪录片中发现和学习新的对象

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-01 DOI: 10.1109/CVPR.2017.124

Kai Chen, Hang Song, Chen Change Loy, Dahua Lin

Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with a fixed list of categories, while training from scratch usually requires a large amount of training data with detailed annotations. This work aims to explore a novel approach – learning object detectors from documentary films in a weakly supervised manner. This is inspired by the observation that documentaries often provide dedicated exposition of certain object categories, where visual presentations are aligned with subtitles. We believe that object detectors can be learned from such a rich source of information. Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links. On top of this formulation, we further derive a weakly supervised learning algorithm, where object model learning and training set mining are unified in an optimization procedure. Experimental results on a real world dataset demonstrate that this is an effective approach to learning new object detectors.

尽管近年来取得了显著进展，但在新环境下检测物体仍然是一项具有挑战性的任务。从公共数据集学习的检测器只能处理固定的类别列表，而从头开始训练通常需要大量带有详细注释的训练数据。这项工作旨在探索一种新的方法–以弱监督的方式从纪录片中学习对象检测器。这是由于观察到纪录片经常提供特定对象类别的专门阐述，其中视觉呈现与字幕对齐。我们相信，物体探测器可以从如此丰富的信息源中学习。为了实现这一目标，我们开发了一个联合概率框架，其中包括视频帧和字幕在内的单个信息片段通过视觉和语言链接汇集在一起。在此基础上，我们进一步推导了一个弱监督学习算法，其中对象模型学习和训练集挖掘统一在一个优化过程中。在真实数据集上的实验结果表明，这是一种学习新目标检测器的有效方法。

{"title":"Discover and Learn New Objects from Documentaries","authors":"Kai Chen, Hang Song, Chen Change Loy, Dahua Lin","doi":"10.1109/CVPR.2017.124","DOIUrl":"https://doi.org/10.1109/CVPR.2017.124","url":null,"abstract":"Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with a fixed list of categories, while training from scratch usually requires a large amount of training data with detailed annotations. This work aims to explore a novel approach – learning object detectors from documentary films in a weakly supervised manner. This is inspired by the observation that documentaries often provide dedicated exposition of certain object categories, where visual presentations are aligned with subtitles. We believe that object detectors can be learned from such a rich source of information. Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links. On top of this formulation, we further derive a weakly supervised learning algorithm, where object model learning and training set mining are unified in an optimization procedure. Experimental results on a real world dataset demonstrate that this is an effective approach to learning new object detectors.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"23 1","pages":"1111-1120"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73797775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Learning to Align Semantic Segmentation and 2.5D Maps for Geolocalization 学习对齐语义分割和2.5D地图用于地理定位

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-01 DOI: 10.1109/CVPR.2017.488

Anil Armagan, Martin Hirzer, P. Roth, V. Lepetit

We present an efficient method for geolocalization in urban environments starting from a coarse estimate of the location provided by a GPS and using a simple untextured 2.5D model of the surrounding buildings. Our key contribution is a novel efficient and robust method to optimize the pose: We train a Deep Network to predict the best direction to improve a pose estimate, given a semantic segmentation of the input image and a rendering of the buildings from this estimate. We then iteratively apply this CNN until converging to a good pose. This approach avoids the use of reference images of the surroundings, which are difficult to acquire and match, while 2.5D models are broadly available. We can therefore apply it to places unseen during training.

我们提出了一种在城市环境中进行地理定位的有效方法，从GPS提供的粗略位置估计开始，并使用周围建筑物的简单无纹理2.5D模型。我们的关键贡献是一种新的高效鲁棒的姿态优化方法:我们训练一个深度网络来预测最佳方向，以改进姿态估计，给定输入图像的语义分割和该估计的建筑物渲染。然后我们迭代地应用这个CNN，直到收敛到一个好的姿势。这种方法避免了使用难以获取和匹配的周围环境的参考图像，而2.5D模型则广泛可用。因此，我们可以将其应用于训练中看不到的地方。

引用次数: 30

Video Segmentation via Multiple Granularity Analysis 基于多粒度分析的视频分割

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-01 DOI: 10.1109/CVPR.2017.676

Rui Yang, Bingbing Ni, Chao Ma, Yi Xu, Xiaokang Yang

We introduce a Multiple Granularity Analysis framework for video segmentation in a coarse-to-fine manner. We cast video segmentation as a spatio-temporal superpixel labeling problem. Benefited from the bounding volume provided by off-the-shelf object trackers, we estimate the foreground/ background super-pixel labeling using the spatiotemporal multiple instance learning algorithm to obtain coarse foreground/background separation within the volume. We further refine the segmentation mask in the pixel level using the graph-cut model. Extensive experiments on benchmark video datasets demonstrate the superior performance of the proposed video segmentation algorithm.

我们引入了一个多粒度分析框架，用于视频从粗到精的分割。我们将视频分割视为一个时空超像素标记问题。利用现成的目标跟踪器提供的边界体，我们使用时空多实例学习算法估计前景/背景超像素标记，以在体积内获得粗略的前景/背景分离。我们使用图切模型在像素级进一步细化分割掩码。在基准视频数据集上的大量实验证明了所提出的视频分割算法的优越性能。

引用次数: 10

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀