J. D. Wegner, Steve Branson, David Hall, K. Schindler, P. Perona
Each corner of the inhabited world is imaged from multiple viewpoints with increasing frequency. Online map services like Google Maps or Here Maps provide direct access to huge amounts of densely sampled, georeferenced images from street view and aerial perspective. There is an opportunity to design computer vision systems that will help us search, catalog and monitor public infrastructure, buildings and artifacts. We explore the architecture and feasibility of such a system. The main technical challenge is combining test time information from multiple views of each geographic location (e.g., aerial and street views). We implement two modules: det2geo, which detects the set of locations of objects belonging to a given category, and geo2cat, which computes the fine-grained category of the object at a given location. We introduce a solution that adapts state-of the-art CNN-based object detectors and classifiers. We test our method on "Pasadena Urban Trees", a new dataset of 80,000 trees with geographic and species annotations, and show that combining multiple views significantly improves both tree detection and tree species classification, rivaling human performance.
{"title":"Cataloging Public Objects Using Aerial and Street-Level Images — Urban Trees","authors":"J. D. Wegner, Steve Branson, David Hall, K. Schindler, P. Perona","doi":"10.1109/CVPR.2016.647","DOIUrl":"https://doi.org/10.1109/CVPR.2016.647","url":null,"abstract":"Each corner of the inhabited world is imaged from multiple viewpoints with increasing frequency. Online map services like Google Maps or Here Maps provide direct access to huge amounts of densely sampled, georeferenced images from street view and aerial perspective. There is an opportunity to design computer vision systems that will help us search, catalog and monitor public infrastructure, buildings and artifacts. We explore the architecture and feasibility of such a system. The main technical challenge is combining test time information from multiple views of each geographic location (e.g., aerial and street views). We implement two modules: det2geo, which detects the set of locations of objects belonging to a given category, and geo2cat, which computes the fine-grained category of the object at a given location. We introduce a solution that adapts state-of the-art CNN-based object detectors and classifiers. We test our method on \"Pasadena Urban Trees\", a new dataset of 80,000 trees with geographic and species annotations, and show that combining multiple views significantly improves both tree detection and tree species classification, rivaling human performance.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"44 1","pages":"6014-6023"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82164718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Feng, Brian L. Price, Scott D. Cohen, Shih-Fu Chang
Interactive image segmentation is an important problem in computer vision with many applications including image editing, object recognition and image retrieval. Most existing interactive segmentation methods only operate on color images. Until recently, very few works have been proposed to leverage depth information from low-cost sensors to improve interactive segmentation. While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way. We propose a novel interactive segmentation algorithm which can incorporate multiple feature cues like color, depth, and normals in an unified graph cut framework to leverage these cues more effectively. A key contribution of our method is that it automatically selects a single cue to be used at each pixel, based on the intuition that only one cue is necessary to determine the segmentation label locally. This is achieved by optimizing over both segmentation labels and cue labels, using terms designed to decide where both the segmentation and label cues should change. Our algorithm thus produces not only the segmentation mask but also a cue label map that indicates where each cue contributes to the final result. Extensive experiments on five large scale RGBD datasets show that our proposed algorithm performs significantly better than both other color-based and RGBD based algorithms in reducing the amount of user inputs as well as increasing segmentation accuracy.
{"title":"Interactive Segmentation on RGBD Images via Cue Selection","authors":"Jie Feng, Brian L. Price, Scott D. Cohen, Shih-Fu Chang","doi":"10.1109/CVPR.2016.24","DOIUrl":"https://doi.org/10.1109/CVPR.2016.24","url":null,"abstract":"Interactive image segmentation is an important problem in computer vision with many applications including image editing, object recognition and image retrieval. Most existing interactive segmentation methods only operate on color images. Until recently, very few works have been proposed to leverage depth information from low-cost sensors to improve interactive segmentation. While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way. We propose a novel interactive segmentation algorithm which can incorporate multiple feature cues like color, depth, and normals in an unified graph cut framework to leverage these cues more effectively. A key contribution of our method is that it automatically selects a single cue to be used at each pixel, based on the intuition that only one cue is necessary to determine the segmentation label locally. This is achieved by optimizing over both segmentation labels and cue labels, using terms designed to decide where both the segmentation and label cues should change. Our algorithm thus produces not only the segmentation mask but also a cue label map that indicates where each cue contributes to the final result. Extensive experiments on five large scale RGBD datasets show that our proposed algorithm performs significantly better than both other color-based and RGBD based algorithms in reducing the amount of user inputs as well as increasing segmentation accuracy.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"46 1","pages":"156-164"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85763239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object-and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.
{"title":"Harnessing Object and Scene Semantics for Large-Scale Video Understanding","authors":"Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, L. Sigal","doi":"10.1109/CVPR.2016.339","DOIUrl":"https://doi.org/10.1109/CVPR.2016.339","url":null,"abstract":"Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object-and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1","pages":"3112-3121"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80779571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Blaha, Christoph Vogel, Audrey Richard, J. D. Wegner, T. Pock, K. Schindler
We propose an adaptive multi-resolution formulation of semantic 3D reconstruction. Given a set of images of a scene, semantic 3D reconstruction aims to densely reconstruct both the 3D shape of the scene and a segmentation into semantic object classes. Jointly reasoning about shape and class allows one to take into account class-specific shape priors (e.g., building walls should be smooth and vertical, and vice versa smooth, vertical surfaces are likely to be building walls), leading to improved reconstruction results. So far, semantic 3D reconstruction methods have been limited to small scenes and low resolution, because of their large memory footprint and computational cost. To scale them up to large scenes, we propose a hierarchical scheme which refines the reconstruction only in regions that are likely to contain a surface, exploiting the fact that both high spatial resolution and high numerical precision are only required in those regions. Our scheme amounts to solving a sequence of convex optimizations while progressively removing constraints, in such a way that the energy, in each iteration, is the tightest possible approximation of the underlying energy at full resolution. In our experiments the method saves up to 98% memory and 95% computation time, without any loss of accuracy.
{"title":"Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-resolution Model for Multi-class Volumetric Labeling","authors":"M. Blaha, Christoph Vogel, Audrey Richard, J. D. Wegner, T. Pock, K. Schindler","doi":"10.1109/CVPR.2016.346","DOIUrl":"https://doi.org/10.1109/CVPR.2016.346","url":null,"abstract":"We propose an adaptive multi-resolution formulation of semantic 3D reconstruction. Given a set of images of a scene, semantic 3D reconstruction aims to densely reconstruct both the 3D shape of the scene and a segmentation into semantic object classes. Jointly reasoning about shape and class allows one to take into account class-specific shape priors (e.g., building walls should be smooth and vertical, and vice versa smooth, vertical surfaces are likely to be building walls), leading to improved reconstruction results. So far, semantic 3D reconstruction methods have been limited to small scenes and low resolution, because of their large memory footprint and computational cost. To scale them up to large scenes, we propose a hierarchical scheme which refines the reconstruction only in regions that are likely to contain a surface, exploiting the fact that both high spatial resolution and high numerical precision are only required in those regions. Our scheme amounts to solving a sequence of convex optimizations while progressively removing constraints, in such a way that the energy, in each iteration, is the tightest possible approximation of the underlying energy at full resolution. In our experiments the method saves up to 98% memory and 95% computation time, without any loss of accuracy.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 1","pages":"3176-3184"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78492971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiansheng Chen, Gaocheng Bai, Shaoheng Liang, Zhengqin Li
Attention based automatic image cropping aims at preserving the most visually important region in an image. A common task in this kind of method is to search for the smallest rectangle inside which the summed attention is maximized. We demonstrate that under appropriate formulations, this task can be achieved using efficient algorithms with low computational complexity. In a practically useful scenario where the aspect ratio of the cropping rectangle is given, the problem can be solved with a computational complexity linear to the number of image pixels. We also study the possibility of multiple rectangle cropping and a new model facilitating fully automated image cropping.
{"title":"Automatic Image Cropping: A Computational Complexity Study","authors":"Jiansheng Chen, Gaocheng Bai, Shaoheng Liang, Zhengqin Li","doi":"10.1109/CVPR.2016.61","DOIUrl":"https://doi.org/10.1109/CVPR.2016.61","url":null,"abstract":"Attention based automatic image cropping aims at preserving the most visually important region in an image. A common task in this kind of method is to search for the smallest rectangle inside which the summed attention is maximized. We demonstrate that under appropriate formulations, this task can be achieved using efficient algorithms with low computational complexity. In a practically useful scenario where the aspect ratio of the cropping rectangle is given, the problem can be solved with a computational complexity linear to the number of image pixels. We also study the possibility of multiple rectangle cropping and a new model facilitating fully automated image cropping.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"20 1","pages":"507-515"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80317710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Face alignment or facial landmark detection plays an important role in many computer vision applications, e.g., face recognition, facial expression recognition, face animation, etc. However, the performance of face alignment system degenerates severely when occlusions occur. In this work, we propose a novel face alignment method, which cascades several Deep Regression networks coupled with De-corrupt Autoencoders (denoted as DRDA) to explicitly handle partial occlusion problem. Different from the previous works that can only detect occlusions and discard the occluded parts, our proposed de-corrupt autoencoder network can automatically recover the genuine appearance for the occluded parts and the recovered parts can be leveraged together with those non-occluded parts for more accurate alignment. By coupling de-corrupt autoencoders with deep regression networks, a deep alignment model robust to partial occlusions is achieved. Besides, our method can localize occluded regions rather than merely predict whether the landmarks are occluded. Experiments on two challenging occluded face datasets demonstrate that our method significantly outperforms the state-of-the-art methods.
{"title":"Occlusion-Free Face Alignment: Deep Regression Networks Coupled with De-Corrupt AutoEncoders","authors":"Jie Zhang, Meina Kan, S. Shan, Xilin Chen","doi":"10.1109/CVPR.2016.373","DOIUrl":"https://doi.org/10.1109/CVPR.2016.373","url":null,"abstract":"Face alignment or facial landmark detection plays an important role in many computer vision applications, e.g., face recognition, facial expression recognition, face animation, etc. However, the performance of face alignment system degenerates severely when occlusions occur. In this work, we propose a novel face alignment method, which cascades several Deep Regression networks coupled with De-corrupt Autoencoders (denoted as DRDA) to explicitly handle partial occlusion problem. Different from the previous works that can only detect occlusions and discard the occluded parts, our proposed de-corrupt autoencoder network can automatically recover the genuine appearance for the occluded parts and the recovered parts can be leveraged together with those non-occluded parts for more accurate alignment. By coupling de-corrupt autoencoders with deep regression networks, a deep alignment model robust to partial occlusions is achieved. Besides, our method can localize occluded regions rather than merely predict whether the landmarks are occluded. Experiments on two challenging occluded face datasets demonstrate that our method significantly outperforms the state-of-the-art methods.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"3428-3437"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88665240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, S. Fidler, R. Urtasun
The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.
{"title":"Monocular 3D Object Detection for Autonomous Driving","authors":"Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, S. Fidler, R. Urtasun","doi":"10.1109/CVPR.2016.236","DOIUrl":"https://doi.org/10.1109/CVPR.2016.236","url":null,"abstract":"The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high-quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"43 1","pages":"2147-2156"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90595389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are dealing with the problem of fine-grained vehicle make&model recognition and verification. Our contribution is showing that extracting additional data from the video stream - besides the vehicle image itself - and feeding it into the deep convolutional neural network boosts the recognition performance considerably. This additional information includes: 3D vehicle bounding box used for "unpacking" the vehicle image, its rasterized low-resolution shape, and information about the 3D vehicle orientation. Experiments show that adding such information decreases classification error by 26% (the accuracy is improved from 0.772 to 0.832) and boosts verification average precision by 208% (0.378 to 0.785) compared to baseline pure CNN without any input modifications. Also, the pure baseline CNN outperforms the recent state of the art solution by 0.081. We provide an annotated set "BoxCars" of surveillance vehicle images augmented by various automatically extracted auxiliary information. Our approach and the dataset can considerably improve the performance of traffic surveillance systems.
{"title":"BoxCars: 3D Boxes as CNN Input for Improved Fine-Grained Vehicle Recognition","authors":"Jakub Sochor, A. Herout, Jirí Havel","doi":"10.1109/CVPR.2016.328","DOIUrl":"https://doi.org/10.1109/CVPR.2016.328","url":null,"abstract":"We are dealing with the problem of fine-grained vehicle make&model recognition and verification. Our contribution is showing that extracting additional data from the video stream - besides the vehicle image itself - and feeding it into the deep convolutional neural network boosts the recognition performance considerably. This additional information includes: 3D vehicle bounding box used for \"unpacking\" the vehicle image, its rasterized low-resolution shape, and information about the 3D vehicle orientation. Experiments show that adding such information decreases classification error by 26% (the accuracy is improved from 0.772 to 0.832) and boosts verification average precision by 208% (0.378 to 0.785) compared to baseline pure CNN without any input modifications. Also, the pure baseline CNN outperforms the recent state of the art solution by 0.081. We provide an annotated set \"BoxCars\" of surveillance vehicle images augmented by various automatically extracted auxiliary information. Our approach and the dataset can considerably improve the performance of traffic surveillance systems.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"4 1","pages":"3006-3015"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86921586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a system that finds text in natural scenes using a variety of cues. Our novel data-driven method incorporates coarse-to-fine detection of character pixels using convolutional features (Text-Conv), followed by extracting connected components (CCs) from characters using edge and color features, and finally performing a graph-based segmentation of CCs into words (Word-Graph). For Text-Conv, the initial detection is based on convolutional feature maps similar to those used in Convolutional Neural Networks (CNNs), but learned using Convolutional k-means. Convolution masks defined by local and neighboring patch features are used to improve detection accuracy. The Word-Graph algorithm uses contextual information to both improve word segmentation and prune false character/word detections. Different definitions for foreground (text) regions are used to train the detection stages, some based on bounding box intersection, and others on bounding box and pixel intersection. Our system obtains pixel, character, and word detection f-measures of 93.14%, 90.26%, and 86.77% respectively for the ICDAR 2015 Robust Reading Focused Scene Text dataset, out-performing state-of-the-art systems. This approach may work for other detection targets with homogenous color in natural scenes.
{"title":"A Text Detection System for Natural Scenes with Convolutional Feature Learning and Cascaded Classification","authors":"Siyu Zhu, R. Zanibbi","doi":"10.1109/CVPR.2016.74","DOIUrl":"https://doi.org/10.1109/CVPR.2016.74","url":null,"abstract":"We propose a system that finds text in natural scenes using a variety of cues. Our novel data-driven method incorporates coarse-to-fine detection of character pixels using convolutional features (Text-Conv), followed by extracting connected components (CCs) from characters using edge and color features, and finally performing a graph-based segmentation of CCs into words (Word-Graph). For Text-Conv, the initial detection is based on convolutional feature maps similar to those used in Convolutional Neural Networks (CNNs), but learned using Convolutional k-means. Convolution masks defined by local and neighboring patch features are used to improve detection accuracy. The Word-Graph algorithm uses contextual information to both improve word segmentation and prune false character/word detections. Different definitions for foreground (text) regions are used to train the detection stages, some based on bounding box intersection, and others on bounding box and pixel intersection. Our system obtains pixel, character, and word detection f-measures of 93.14%, 90.26%, and 86.77% respectively for the ICDAR 2015 Robust Reading Focused Scene Text dataset, out-performing state-of-the-art systems. This approach may work for other detection targets with homogenous color in natural scenes.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"110 1","pages":"625-632"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86740293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saumitro Dasgupta, Kuan Fang, Kevin Chen, S. Savarese
We consider the problem of estimating the spatial layout of an indoor scene from a monocular RGB image, modeled as the projection of a 3D cuboid. Existing solutions to this problem often rely strongly on hand-engineered features and vanishing point detection, which are prone to failure in the presence of clutter. In this paper, we present a method that uses a fully convolutional neural network (FCNN) in conjunction with a novel optimization framework for generating layout estimates. We demonstrate that our method is robust in the presence of clutter and handles a wide range of highly challenging scenes. We evaluate our method on two standard benchmarks and show that it achieves state of the art results, outperforming previous methods by a wide margin.
{"title":"DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes","authors":"Saumitro Dasgupta, Kuan Fang, Kevin Chen, S. Savarese","doi":"10.1109/CVPR.2016.73","DOIUrl":"https://doi.org/10.1109/CVPR.2016.73","url":null,"abstract":"We consider the problem of estimating the spatial layout of an indoor scene from a monocular RGB image, modeled as the projection of a 3D cuboid. Existing solutions to this problem often rely strongly on hand-engineered features and vanishing point detection, which are prone to failure in the presence of clutter. In this paper, we present a method that uses a fully convolutional neural network (FCNN) in conjunction with a novel optimization framework for generating layout estimates. We demonstrate that our method is robust in the presence of clutter and handles a wide range of highly challenging scenes. We evaluate our method on two standard benchmarks and show that it achieves state of the art results, outperforming previous methods by a wide margin.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"6 1","pages":"616-624"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84770744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}