2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文中文

Cataloging Public Objects Using Aerial and Street-Level Images — Urban Trees 使用空中和街道级图像编目公共对象-城市树木

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.647

J. D. Wegner, Steve Branson, David Hall, K. Schindler, P. Perona

Each corner of the inhabited world is imaged from multiple viewpoints with increasing frequency. Online map services like Google Maps or Here Maps provide direct access to huge amounts of densely sampled, georeferenced images from street view and aerial perspective. There is an opportunity to design computer vision systems that will help us search, catalog and monitor public infrastructure, buildings and artifacts. We explore the architecture and feasibility of such a system. The main technical challenge is combining test time information from multiple views of each geographic location (e.g., aerial and street views). We implement two modules: det2geo, which detects the set of locations of objects belonging to a given category, and geo2cat, which computes the fine-grained category of the object at a given location. We introduce a solution that adapts state-of the-art CNN-based object detectors and classifiers. We test our method on "Pasadena Urban Trees", a new dataset of 80,000 trees with geographic and species annotations, and show that combining multiple views significantly improves both tree detection and tree species classification, rivaling human performance.

有人居住的世界的每个角落都从多个视点成像，频率越来越高。像谷歌Maps或Here Maps这样的在线地图服务提供了直接访问大量密集采样的街道视图和空中视角的地理参考图像。我们有机会设计计算机视觉系统，帮助我们搜索、编目和监控公共基础设施、建筑和文物。我们探讨了这样一个系统的架构和可行性。主要的技术挑战是结合来自每个地理位置的多个视图的测试时间信息(例如，空中和街道视图)。我们实现了两个模块:det2geo和geo2cat，前者检测属于给定类别的对象的位置集，后者计算给定位置上对象的细粒度类别。我们介绍了一种解决方案，该解决方案采用了最先进的基于cnn的对象检测器和分类器。我们在“帕萨迪纳城市树木”上测试了我们的方法，这是一个包含80,000棵树的新数据集，带有地理和物种注释，结果表明，结合多个视图显著提高了树木检测和树种分类，与人类的表现相媲美。

{"title":"Cataloging Public Objects Using Aerial and Street-Level Images — Urban Trees","authors":"J. D. Wegner, Steve Branson, David Hall, K. Schindler, P. Perona","doi":"10.1109/CVPR.2016.647","DOIUrl":"https://doi.org/10.1109/CVPR.2016.647","url":null,"abstract":"Each corner of the inhabited world is imaged from multiple viewpoints with increasing frequency. Online map services like Google Maps or Here Maps provide direct access to huge amounts of densely sampled, georeferenced images from street view and aerial perspective. There is an opportunity to design computer vision systems that will help us search, catalog and monitor public infrastructure, buildings and artifacts. We explore the architecture and feasibility of such a system. The main technical challenge is combining test time information from multiple views of each geographic location (e.g., aerial and street views). We implement two modules: det2geo, which detects the set of locations of objects belonging to a given category, and geo2cat, which computes the fine-grained category of the object at a given location. We introduce a solution that adapts state-of the-art CNN-based object detectors and classifiers. We test our method on \"Pasadena Urban Trees\", a new dataset of 80,000 trees with geographic and species annotations, and show that combining multiple views significantly improves both tree detection and tree species classification, rivaling human performance.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"44 1","pages":"6014-6023"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82164718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 148

Interactive Segmentation on RGBD Images via Cue Selection 基于线索选择的RGBD图像交互式分割

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.24

Jie Feng, Brian L. Price, Scott D. Cohen, Shih-Fu Chang

Interactive image segmentation is an important problem in computer vision with many applications including image editing, object recognition and image retrieval. Most existing interactive segmentation methods only operate on color images. Until recently, very few works have been proposed to leverage depth information from low-cost sensors to improve interactive segmentation. While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way. We propose a novel interactive segmentation algorithm which can incorporate multiple feature cues like color, depth, and normals in an unified graph cut framework to leverage these cues more effectively. A key contribution of our method is that it automatically selects a single cue to be used at each pixel, based on the intuition that only one cue is necessary to determine the segmentation label locally. This is achieved by optimizing over both segmentation labels and cue labels, using terms designed to decide where both the segmentation and label cues should change. Our algorithm thus produces not only the segmentation mask but also a cue label map that indicates where each cue contributes to the final result. Extensive experiments on five large scale RGBD datasets show that our proposed algorithm performs significantly better than both other color-based and RGBD based algorithms in reducing the amount of user inputs as well as increasing segmentation accuracy.

交互式图像分割是计算机视觉中的一个重要问题，在图像编辑、目标识别和图像检索等方面有着广泛的应用。现有的交互式分割方法大多只对彩色图像进行分割。直到最近，很少有人提出利用低成本传感器的深度信息来改进交互式分割。虽然这些方法比基于颜色的方法获得更好的结果，但它们仍然局限于使用深度作为额外的颜色通道或简单地以线性方式将深度与颜色结合起来。我们提出了一种新的交互式分割算法，该算法可以在统一的图切框架中包含多个特征线索，如颜色、深度和法线，以更有效地利用这些线索。我们的方法的一个关键贡献是，它自动选择一个线索在每个像素上使用，基于直觉，只有一个线索是必要的，以确定局部分割标签。这是通过对分割标签和线索标签进行优化来实现的，使用旨在决定分割和标签线索应该改变的术语。因此，我们的算法不仅产生分割掩码，而且还产生提示标签映射，该映射指示每个提示对最终结果的贡献。在5个大规模RGBD数据集上的大量实验表明，我们提出的算法在减少用户输入量和提高分割精度方面明显优于其他基于颜色和RGBD的算法。

{"title":"Interactive Segmentation on RGBD Images via Cue Selection","authors":"Jie Feng, Brian L. Price, Scott D. Cohen, Shih-Fu Chang","doi":"10.1109/CVPR.2016.24","DOIUrl":"https://doi.org/10.1109/CVPR.2016.24","url":null,"abstract":"Interactive image segmentation is an important problem in computer vision with many applications including image editing, object recognition and image retrieval. Most existing interactive segmentation methods only operate on color images. Until recently, very few works have been proposed to leverage depth information from low-cost sensors to improve interactive segmentation. While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way. We propose a novel interactive segmentation algorithm which can incorporate multiple feature cues like color, depth, and normals in an unified graph cut framework to leverage these cues more effectively. A key contribution of our method is that it automatically selects a single cue to be used at each pixel, based on the intuition that only one cue is necessary to determine the segmentation label locally. This is achieved by optimizing over both segmentation labels and cue labels, using terms designed to decide where both the segmentation and label cues should change. Our algorithm thus produces not only the segmentation mask but also a cue label map that indicates where each cue contributes to the final result. Extensive experiments on five large scale RGBD datasets show that our proposed algorithm performs significantly better than both other color-based and RGBD based algorithms in reducing the amount of user inputs as well as increasing segmentation accuracy.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"46 1","pages":"156-164"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85763239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Harnessing Object and Scene Semantics for Large-Scale Video Understanding 利用对象和场景语义进行大规模视频理解

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.339

Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, L. Sigal

Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object-and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.

大规模动作识别和视频分类是计算机视觉中的重要问题。为了解决这些问题，我们提出了一种新的基于对象和场景的语义融合网络和表示。我们的语义融合网络使用三层神经网络结合了三个信息流:(i)基于帧的低级CNN特征，(ii)来自最先进的大规模CNN对象检测器的对象特征，训练以识别20K个类别，以及(iii)来自最先进的CNN场景检测器的场景特征，训练以识别205个场景。训练后的网络分别在ActivityNet和FCVID两个复杂的大规模数据集上实现了监督活动和视频分类的改进。此外，通过融合网络检查和反向传播信息，可以发现视频类和对象/场景之间的语义关系(相关性)。这些视频类-对象/视频类-场景关系反过来可以用作视频类本身的语义表示。我们通过零镜头动作/视频分类和聚类实验说明了这种语义表示的有效性。

{"title":"Harnessing Object and Scene Semantics for Large-Scale Video Understanding","authors":"Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, L. Sigal","doi":"10.1109/CVPR.2016.339","DOIUrl":"https://doi.org/10.1109/CVPR.2016.339","url":null,"abstract":"Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object-and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1","pages":"3112-3121"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80779571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-resolution Model for Multi-class Volumetric Labeling 大规模语义三维重建:多类体积标记的自适应多分辨率模型

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.346

M. Blaha, Christoph Vogel, Audrey Richard, J. D. Wegner, T. Pock, K. Schindler

We propose an adaptive multi-resolution formulation of semantic 3D reconstruction. Given a set of images of a scene, semantic 3D reconstruction aims to densely reconstruct both the 3D shape of the scene and a segmentation into semantic object classes. Jointly reasoning about shape and class allows one to take into account class-specific shape priors (e.g., building walls should be smooth and vertical, and vice versa smooth, vertical surfaces are likely to be building walls), leading to improved reconstruction results. So far, semantic 3D reconstruction methods have been limited to small scenes and low resolution, because of their large memory footprint and computational cost. To scale them up to large scenes, we propose a hierarchical scheme which refines the reconstruction only in regions that are likely to contain a surface, exploiting the fact that both high spatial resolution and high numerical precision are only required in those regions. Our scheme amounts to solving a sequence of convex optimizations while progressively removing constraints, in such a way that the energy, in each iteration, is the tightest possible approximation of the underlying energy at full resolution. In our experiments the method saves up to 98% memory and 95% computation time, without any loss of accuracy.

提出了一种语义三维重建的自适应多分辨率公式。给定一组场景图像，语义三维重建旨在密集地重建场景的三维形状并分割为语义对象类。关于形状和类别的联合推理允许人们考虑特定类别的形状先验(例如，建筑墙壁应该是光滑和垂直的，反之亦然，光滑，垂直的表面可能是建筑墙壁)，从而改善重建结果。到目前为止，语义三维重建方法由于内存占用和计算成本大，仅限于小场景和低分辨率。为了将它们扩展到大型场景，我们提出了一种分层方案，该方案仅在可能包含表面的区域中改进重建，利用这些区域只需要高空间分辨率和高数值精度的事实。我们的方案相当于解决一系列凸优化，同时逐步消除约束，以这样一种方式，在每次迭代中，能量是在全分辨率下最接近底层能量的可能近似。在我们的实验中，该方法节省了98%的内存和95%的计算时间，没有任何准确性损失。

{"title":"Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-resolution Model for Multi-class Volumetric Labeling","authors":"M. Blaha, Christoph Vogel, Audrey Richard, J. D. Wegner, T. Pock, K. Schindler","doi":"10.1109/CVPR.2016.346","DOIUrl":"https://doi.org/10.1109/CVPR.2016.346","url":null,"abstract":"We propose an adaptive multi-resolution formulation of semantic 3D reconstruction. Given a set of images of a scene, semantic 3D reconstruction aims to densely reconstruct both the 3D shape of the scene and a segmentation into semantic object classes. Jointly reasoning about shape and class allows one to take into account class-specific shape priors (e.g., building walls should be smooth and vertical, and vice versa smooth, vertical surfaces are likely to be building walls), leading to improved reconstruction results. So far, semantic 3D reconstruction methods have been limited to small scenes and low resolution, because of their large memory footprint and computational cost. To scale them up to large scenes, we propose a hierarchical scheme which refines the reconstruction only in regions that are likely to contain a surface, exploiting the fact that both high spatial resolution and high numerical precision are only required in those regions. Our scheme amounts to solving a sequence of convex optimizations while progressively removing constraints, in such a way that the energy, in each iteration, is the tightest possible approximation of the underlying energy at full resolution. In our experiments the method saves up to 98% memory and 95% computation time, without any loss of accuracy.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 1","pages":"3176-3184"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78492971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

Automatic Image Cropping: A Computational Complexity Study 自动图像裁剪:计算复杂度研究

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.61

Jiansheng Chen, Gaocheng Bai, Shaoheng Liang, Zhengqin Li

Attention based automatic image cropping aims at preserving the most visually important region in an image. A common task in this kind of method is to search for the smallest rectangle inside which the summed attention is maximized. We demonstrate that under appropriate formulations, this task can be achieved using efficient algorithms with low computational complexity. In a practically useful scenario where the aspect ratio of the cropping rectangle is given, the problem can be solved with a computational complexity linear to the number of image pixels. We also study the possibility of multiple rectangle cropping and a new model facilitating fully automated image cropping.

基于注意力的自动图像裁剪旨在保留图像中视觉上最重要的区域。在这种方法中，一个常见的任务是寻找最小的矩形，其中的注意力总和是最大的。我们证明，在适当的公式下，这项任务可以使用低计算复杂度的高效算法来实现。在一个实际有用的场景中，裁剪矩形的宽高比是给定的，这个问题可以用与图像像素数线性的计算复杂度来解决。我们还研究了多个矩形裁剪的可能性和一种实现全自动图像裁剪的新模型。

引用次数: 93

Occlusion-Free Face Alignment: Deep Regression Networks Coupled with De-Corrupt AutoEncoders 无遮挡的人脸对齐:深度回归网络与去腐败自编码器相结合

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.373

Jie Zhang, Meina Kan, S. Shan, Xilin Chen

Face alignment or facial landmark detection plays an important role in many computer vision applications, e.g., face recognition, facial expression recognition, face animation, etc. However, the performance of face alignment system degenerates severely when occlusions occur. In this work, we propose a novel face alignment method, which cascades several Deep Regression networks coupled with De-corrupt Autoencoders (denoted as DRDA) to explicitly handle partial occlusion problem. Different from the previous works that can only detect occlusions and discard the occluded parts, our proposed de-corrupt autoencoder network can automatically recover the genuine appearance for the occluded parts and the recovered parts can be leveraged together with those non-occluded parts for more accurate alignment. By coupling de-corrupt autoencoders with deep regression networks, a deep alignment model robust to partial occlusions is achieved. Besides, our method can localize occluded regions rather than merely predict whether the landmarks are occluded. Experiments on two challenging occluded face datasets demonstrate that our method significantly outperforms the state-of-the-art methods.

人脸对齐或人脸地标检测在许多计算机视觉应用中起着重要作用，如人脸识别、面部表情识别、人脸动画等。然而，当发生咬合时，人脸对准系统的性能会严重下降。在这项工作中，我们提出了一种新的人脸对齐方法，该方法将多个深度回归网络与decorrupt Autoencoders(表示为DRDA)相结合，以显式处理部分遮挡问题。与以往只能检测遮挡并丢弃遮挡部分的工作不同，我们提出的腐败自编码器网络可以自动恢复被遮挡部分的真实外观，并且可以将恢复的部分与未遮挡部分一起利用，以实现更精确的对齐。通过将去腐败自编码器与深度回归网络相结合，实现了对部分遮挡具有鲁棒性的深度对齐模型。此外，我们的方法可以定位被遮挡的区域，而不仅仅是预测地标是否被遮挡。在两个具有挑战性的遮挡人脸数据集上的实验表明，我们的方法明显优于最先进的方法。

{"title":"Occlusion-Free Face Alignment: Deep Regression Networks Coupled with De-Corrupt AutoEncoders","authors":"Jie Zhang, Meina Kan, S. Shan, Xilin Chen","doi":"10.1109/CVPR.2016.373","DOIUrl":"https://doi.org/10.1109/CVPR.2016.373","url":null,"abstract":"Face alignment or facial landmark detection plays an important role in many computer vision applications, e.g., face recognition, facial expression recognition, face animation, etc. However, the performance of face alignment system degenerates severely when occlusions occur. In this work, we propose a novel face alignment method, which cascades several Deep Regression networks coupled with De-corrupt Autoencoders (denoted as DRDA) to explicitly handle partial occlusion problem. Different from the previous works that can only detect occlusions and discard the occluded parts, our proposed de-corrupt autoencoder network can automatically recover the genuine appearance for the occluded parts and the recovered parts can be leveraged together with those non-occluded parts for more accurate alignment. By coupling de-corrupt autoencoders with deep regression networks, a deep alignment model robust to partial occlusions is achieved. Besides, our method can localize occluded regions rather than merely predict whether the landmarks are occluded. Experiments on two challenging occluded face datasets demonstrate that our method significantly outperforms the state-of-the-art methods.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"3428-3437"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88665240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 95

Monocular 3D Object Detection for Autonomous Driving 用于自动驾驶的单目3D目标检测

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.236

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, S. Fidler, R. Urtasun

The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

本文的目标是从自动驾驶领域的单眼图像中进行3D目标检测。我们的方法首先旨在生成一组候选类特定对象建议，然后通过标准的CNN管道运行以获得高质量的对象检测。本文的研究重点是提案生成。特别是，我们提出了一种能量最小化方法，利用物体应该在地平面上的事实，将候选物体放置在3D中。然后，我们通过编码语义分割、上下文信息、大小和位置先验以及典型物体形状的几个直观电位对投影到图像平面上的每个候选框进行评分。我们的实验评估表明，我们的目标建议生成方法明显优于所有单眼方法，并在具有挑战性的KITTI基准测试中取得了最佳的检测性能。

引用次数: 795

BoxCars: 3D Boxes as CNN Input for Improved Fine-Grained Vehicle Recognition BoxCars: 3D盒子作为CNN输入，用于改进细粒度车辆识别

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.328

Jakub Sochor, A. Herout, Jirí Havel

We are dealing with the problem of fine-grained vehicle make&model recognition and verification. Our contribution is showing that extracting additional data from the video stream - besides the vehicle image itself - and feeding it into the deep convolutional neural network boosts the recognition performance considerably. This additional information includes: 3D vehicle bounding box used for "unpacking" the vehicle image, its rasterized low-resolution shape, and information about the 3D vehicle orientation. Experiments show that adding such information decreases classification error by 26% (the accuracy is improved from 0.772 to 0.832) and boosts verification average precision by 208% (0.378 to 0.785) compared to baseline pure CNN without any input modifications. Also, the pure baseline CNN outperforms the recent state of the art solution by 0.081. We provide an annotated set "BoxCars" of surveillance vehicle images augmented by various automatically extracted auxiliary information. Our approach and the dataset can considerably improve the performance of traffic surveillance systems.

我们正在处理细粒度的车型识别和验证问题。我们的贡献是表明，从视频流中提取额外的数据——除了车辆图像本身——并将其输入深度卷积神经网络，大大提高了识别性能。这些附加信息包括:用于“拆封”车辆图像的3D车辆边界框，其栅格化的低分辨率形状，以及有关3D车辆方向的信息。实验表明，与未经任何输入修改的基线纯CNN相比，加入这些信息后，分类误差降低了26%(准确率从0.772提高到0.832)，验证平均精度提高了208%(0.378提高到0.785)。此外，纯基线CNN比最新的最先进的解决方案高出0.081。我们提供了一组标注的“BoxCars”监控车辆图像，这些图像由各种自动提取的辅助信息增强。我们的方法和数据集可以大大提高交通监控系统的性能。

引用次数: 161

A Text Detection System for Natural Scenes with Convolutional Feature Learning and Cascaded Classification 基于卷积特征学习和级联分类的自然场景文本检测系统

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.74

Siyu Zhu, R. Zanibbi

We propose a system that finds text in natural scenes using a variety of cues. Our novel data-driven method incorporates coarse-to-fine detection of character pixels using convolutional features (Text-Conv), followed by extracting connected components (CCs) from characters using edge and color features, and finally performing a graph-based segmentation of CCs into words (Word-Graph). For Text-Conv, the initial detection is based on convolutional feature maps similar to those used in Convolutional Neural Networks (CNNs), but learned using Convolutional k-means. Convolution masks defined by local and neighboring patch features are used to improve detection accuracy. The Word-Graph algorithm uses contextual information to both improve word segmentation and prune false character/word detections. Different definitions for foreground (text) regions are used to train the detection stages, some based on bounding box intersection, and others on bounding box and pixel intersection. Our system obtains pixel, character, and word detection f-measures of 93.14%, 90.26%, and 86.77% respectively for the ICDAR 2015 Robust Reading Focused Scene Text dataset, out-performing state-of-the-art systems. This approach may work for other detection targets with homogenous color in natural scenes.

我们提出了一个系统，可以使用各种线索在自然场景中找到文本。我们的新数据驱动方法结合了使用卷积特征(Text-Conv)对字符像素进行粗到细的检测，然后使用边缘和颜色特征从字符中提取连通成分(cc)，最后执行基于图的cc分割到单词(Word-Graph)。对于Text-Conv，初始检测是基于卷积特征映射，类似于卷积神经网络(cnn)中使用的特征映射，但使用卷积k-means进行学习。利用局部和邻近patch特征定义的卷积掩模来提高检测精度。词图算法使用上下文信息来改进分词和减少假字符/词检测。使用不同的前景(文本)区域定义来训练检测阶段，一些基于边界框相交，另一些基于边界框与像素相交。我们的系统在ICDAR 2015鲁棒阅读聚焦场景文本数据集上获得的像素、字符和单词检测f值分别为93.14%、90.26%和86.77%，优于最先进的系统。该方法也适用于自然场景中其他颜色均匀的检测目标。

{"title":"A Text Detection System for Natural Scenes with Convolutional Feature Learning and Cascaded Classification","authors":"Siyu Zhu, R. Zanibbi","doi":"10.1109/CVPR.2016.74","DOIUrl":"https://doi.org/10.1109/CVPR.2016.74","url":null,"abstract":"We propose a system that finds text in natural scenes using a variety of cues. Our novel data-driven method incorporates coarse-to-fine detection of character pixels using convolutional features (Text-Conv), followed by extracting connected components (CCs) from characters using edge and color features, and finally performing a graph-based segmentation of CCs into words (Word-Graph). For Text-Conv, the initial detection is based on convolutional feature maps similar to those used in Convolutional Neural Networks (CNNs), but learned using Convolutional k-means. Convolution masks defined by local and neighboring patch features are used to improve detection accuracy. The Word-Graph algorithm uses contextual information to both improve word segmentation and prune false character/word detections. Different definitions for foreground (text) regions are used to train the detection stages, some based on bounding box intersection, and others on bounding box and pixel intersection. Our system obtains pixel, character, and word detection f-measures of 93.14%, 90.26%, and 86.77% respectively for the ICDAR 2015 Robust Reading Focused Scene Text dataset, out-performing state-of-the-art systems. This approach may work for other detection targets with homogenous color in natural scenes.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"110 1","pages":"625-632"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86740293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes 延迟:混沌室内场景的鲁棒空间布局估计

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.73

Saumitro Dasgupta, Kuan Fang, Kevin Chen, S. Savarese

We consider the problem of estimating the spatial layout of an indoor scene from a monocular RGB image, modeled as the projection of a 3D cuboid. Existing solutions to this problem often rely strongly on hand-engineered features and vanishing point detection, which are prone to failure in the presence of clutter. In this paper, we present a method that uses a fully convolutional neural network (FCNN) in conjunction with a novel optimization framework for generating layout estimates. We demonstrate that our method is robust in the presence of clutter and handles a wide range of highly challenging scenes. We evaluate our method on two standard benchmarks and show that it achieves state of the art results, outperforming previous methods by a wide margin.

我们考虑从单眼RGB图像估计室内场景的空间布局问题，建模为三维长方体的投影。这个问题的现有解决方案通常强烈依赖于手工设计的特征和消失点检测，这在存在杂乱的情况下容易失败。在本文中，我们提出了一种使用全卷积神经网络(FCNN)和一种新的优化框架来生成布局估计的方法。我们证明了我们的方法在杂乱的存在下是鲁棒的，并且可以处理各种极具挑战性的场景。我们在两个标准基准上评估我们的方法，并表明它达到了最先进的结果，远远优于以前的方法。

引用次数: 116

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀