Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, D. Lischinski, D. Cohen-Or, Baoquan Chen
Human 3D pose estimation from a single image is a challenging task with numerous applications. Convolutional Neural Networks (CNNs) have recently achieved superior performance on the task of 2D pose estimation from a single image, by training on images with 2D annotations collected by crowd sourcing. This suggests that similar success could be achieved for direct estimation of 3D poses. However, 3D poses are much harder to annotate, and the lack of suitable annotated training images hinders attempts towards end-to-end solutions. To address this issue, we opt to automatically synthesize training images with ground truth pose annotations. Our work is a systematic study along this road. We find that pose space coverage and texture diversity are the key ingredients for the effectiveness of synthetic training data. We present a fully automatic, scalable approach that samples the human pose space for guiding the synthesis procedure and extracts clothing textures from real images. Furthermore, we explore domain adaptation for bridging the gap between our synthetic training images and real testing photos. We demonstrate that CNNs trained with our synthetic images out-perform those trained with real photos on 3D pose estimation tasks.
{"title":"Synthesizing Training Images for Boosting Human 3D Pose Estimation","authors":"Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, D. Lischinski, D. Cohen-Or, Baoquan Chen","doi":"10.1109/3DV.2016.58","DOIUrl":"https://doi.org/10.1109/3DV.2016.58","url":null,"abstract":"Human 3D pose estimation from a single image is a challenging task with numerous applications. Convolutional Neural Networks (CNNs) have recently achieved superior performance on the task of 2D pose estimation from a single image, by training on images with 2D annotations collected by crowd sourcing. This suggests that similar success could be achieved for direct estimation of 3D poses. However, 3D poses are much harder to annotate, and the lack of suitable annotated training images hinders attempts towards end-to-end solutions. To address this issue, we opt to automatically synthesize training images with ground truth pose annotations. Our work is a systematic study along this road. We find that pose space coverage and texture diversity are the key ingredients for the effectiveness of synthetic training data. We present a fully automatic, scalable approach that samples the human pose space for guiding the synthesis procedure and extracts clothing textures from real images. Furthermore, we explore domain adaptation for bridging the gap between our synthetic training images and real testing photos. We demonstrate that CNNs trained with our synthetic images out-perform those trained with real photos on 3D pose estimation tasks.","PeriodicalId":425304,"journal":{"name":"2016 Fourth International Conference on 3D Vision (3DV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126948504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recovering the radiometric properties of a scene (i.e., the reflectance, illumination, and geometry) is a long-sought ability of computer vision that can provide invaluable information for a wide range of applications. Deciphering the radiometric ingredients from the appearance of a real-world scene, as opposed to a single isolated object, is particularly challenging as it generally consists of various objects with different material compositions exhibiting complex reflectance and light interactions that are also part of the illumination. We introduce the first method for radiometric decomposition of real-world scenes that handles those intricacies. We use RGB-D images to bootstrap geometry recovery and simultaneously recover the complex reflectance and natural illumination while refining the noisy initial geometry and segmenting the scene into different material regions. Most important, we handle real-world scenes consisting of multiple objects of unknown materials, which necessitates the modeling of spatially-varying complex reflectance, natural illumination, texture, interreflection and shadows. We systematically evaluate the effectiveness of our method on synthetic scenes and demonstrate its application to real-world scenes. The results show that rich radiometric information can be recovered from RGB-D images and demonstrate a new role RGB-D sensors can play for general scene understanding tasks.
{"title":"Radiometric Scene Decomposition: Scene Reflectance, Illumination, and Geometry from RGB-D Images","authors":"Stephen Lombardi, K. Nishino","doi":"10.1109/3DV.2016.39","DOIUrl":"https://doi.org/10.1109/3DV.2016.39","url":null,"abstract":"Recovering the radiometric properties of a scene (i.e., the reflectance, illumination, and geometry) is a long-sought ability of computer vision that can provide invaluable information for a wide range of applications. Deciphering the radiometric ingredients from the appearance of a real-world scene, as opposed to a single isolated object, is particularly challenging as it generally consists of various objects with different material compositions exhibiting complex reflectance and light interactions that are also part of the illumination. We introduce the first method for radiometric decomposition of real-world scenes that handles those intricacies. We use RGB-D images to bootstrap geometry recovery and simultaneously recover the complex reflectance and natural illumination while refining the noisy initial geometry and segmenting the scene into different material regions. Most important, we handle real-world scenes consisting of multiple objects of unknown materials, which necessitates the modeling of spatially-varying complex reflectance, natural illumination, texture, interreflection and shadows. We systematically evaluate the effectiveness of our method on synthetic scenes and demonstrate its application to real-world scenes. The results show that rich radiometric information can be recovered from RGB-D images and demonstrate a new role RGB-D sensors can play for general scene understanding tasks.","PeriodicalId":425304,"journal":{"name":"2016 Fourth International Conference on 3D Vision (3DV)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124973452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most dense RGB/RGB-D SLAM systems require the brightness of 3-D points observed from different viewpoints to be constant. However, in reality, this assumption is difficult to meet even when the surface is Lambertian and illumination is static. One cause is that most cameras automatically tune exposure to adapt to the wide dynamic range of scene radiance, violating the brightness assumption. We describe a novel system - HDRFusion - which turns this apparent drawback into an advantage by fusing LDR frames into an HDR textured volume using a standard RGB-D sensor with auto-exposure (AE) enabled. The key contribution is the use of a normalised metric for frame alignment which is invariant to changes in exposure time. This enables robust tracking in frame-to-model mode and also compensates the exposure accurately so that HDR texture, free of artefacts, can be generated online. We demonstrate that the tracking robustness and accuracy is greatly improved by the approach and that radiance maps can be generated with far greater dynamic range of scene radiance.
大多数密集的RGB/RGB- d SLAM系统要求从不同视点观察到的三维点的亮度是恒定的。然而,在现实中,即使表面是朗伯面,照明是静态的,这个假设也很难满足。一个原因是大多数相机自动调整曝光以适应场景亮度的宽动态范围,违反了亮度假设。我们描述了一个新颖的系统- HDRFusion -它通过使用启用自动曝光(AE)的标准RGB-D传感器将LDR帧融合到HDR纹理体中,从而将这一明显的缺点转化为优势。关键的贡献是使用一种归一化度量的帧对齐,这是不变的变化的曝光时间。这样可以在帧到模型模式下进行稳健的跟踪,并且可以准确地补偿曝光,从而可以在线生成无伪影的HDR纹理。我们证明,该方法极大地提高了跟踪的鲁棒性和准确性,并且可以在更大的场景亮度动态范围内生成亮度图。
{"title":"HDRFusion: HDR SLAM Using a Low-Cost Auto-Exposure RGB-D Sensor","authors":"Shuda Li, Ankur Handa, Yang Zhang, A. Calway","doi":"10.1109/3DV.2016.40","DOIUrl":"https://doi.org/10.1109/3DV.2016.40","url":null,"abstract":"Most dense RGB/RGB-D SLAM systems require the brightness of 3-D points observed from different viewpoints to be constant. However, in reality, this assumption is difficult to meet even when the surface is Lambertian and illumination is static. One cause is that most cameras automatically tune exposure to adapt to the wide dynamic range of scene radiance, violating the brightness assumption. We describe a novel system - HDRFusion - which turns this apparent drawback into an advantage by fusing LDR frames into an HDR textured volume using a standard RGB-D sensor with auto-exposure (AE) enabled. The key contribution is the use of a normalised metric for frame alignment which is invariant to changes in exposure time. This enables robust tracking in frame-to-model mode and also compensates the exposure accurately so that HDR texture, free of artefacts, can be generated online. We demonstrate that the tracking robustness and accuracy is greatly improved by the approach and that radiance maps can be generated with far greater dynamic range of scene radiance.","PeriodicalId":425304,"journal":{"name":"2016 Fourth International Conference on 3D Vision (3DV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julien P. C. Valentin, Angela Dai, M. Nießner, Pushmeet Kohli, Philip H. S. Torr, S. Izadi, Cem Keskin
In this paper, we present a novel, general, and efficient architecture for addressing computer vision problems that are approached from an 'Analysis by Synthesis' standpoint. Analysis by synthesis involves the minimization of reconstruction error, which is typically a non-convex function of the latent target variables. State-of-the-art methods adopt a hybrid scheme where discriminatively trained predictors like Random Forests or Convolutional Neural Networks are used to initialize local search algorithms. While these hybrid methods have been shown to produce promising results, they often get stuck in local optima. Our method goes beyond the conventional hybrid architecture by not only proposing multiple accurate initial solutions but by also defining a navigational structure over the solution space that can be used for extremely efficient gradient-free local search. We demonstrate the efficacy and generalizability of our approach on tasks as diverse as Hand Pose Estimation, RGB Camera Relocalization, and Image Retrieval.
{"title":"Learning to Navigate the Energy Landscape","authors":"Julien P. C. Valentin, Angela Dai, M. Nießner, Pushmeet Kohli, Philip H. S. Torr, S. Izadi, Cem Keskin","doi":"10.1109/3DV.2016.41","DOIUrl":"https://doi.org/10.1109/3DV.2016.41","url":null,"abstract":"In this paper, we present a novel, general, and efficient architecture for addressing computer vision problems that are approached from an 'Analysis by Synthesis' standpoint. Analysis by synthesis involves the minimization of reconstruction error, which is typically a non-convex function of the latent target variables. State-of-the-art methods adopt a hybrid scheme where discriminatively trained predictors like Random Forests or Convolutional Neural Networks are used to initialize local search algorithms. While these hybrid methods have been shown to produce promising results, they often get stuck in local optima. Our method goes beyond the conventional hybrid architecture by not only proposing multiple accurate initial solutions but by also defining a navigational structure over the solution space that can be used for extremely efficient gradient-free local search. We demonstrate the efficacy and generalizability of our approach on tasks as diverse as Hand Pose Estimation, RGB Camera Relocalization, and Image Retrieval.","PeriodicalId":425304,"journal":{"name":"2016 Fourth International Conference on 3D Vision (3DV)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127858887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolay Kobyshev, Hayko Riemenschneider, A. Bódis-Szomorú, L. Gool
In urban environments the most interesting and effective factors for localization and navigation are landmark buildings. This paper proposes a novel method to detect such buildings that stand out, i.e. would be given the status of 'landmark'. The method works in a fully unsupervised way, i.e. it can be applied to different cities without requiring annotation. First, salient points are detected, based on the analysis of their features as well as those found in their spatial neighborhood. Second, learning refines the points by finding connected landmark components and training a classifier to distinguish these from common building components. Third, landmark components are aggregated into complete landmark buildings. Experiments on city-scale point clouds show the viability and efficiency of our approach on various tasks.
{"title":"3D Saliency for Finding Landmark Buildings","authors":"Nikolay Kobyshev, Hayko Riemenschneider, A. Bódis-Szomorú, L. Gool","doi":"10.1109/3DV.2016.35","DOIUrl":"https://doi.org/10.1109/3DV.2016.35","url":null,"abstract":"In urban environments the most interesting and effective factors for localization and navigation are landmark buildings. This paper proposes a novel method to detect such buildings that stand out, i.e. would be given the status of 'landmark'. The method works in a fully unsupervised way, i.e. it can be applied to different cities without requiring annotation. First, salient points are detected, based on the analysis of their features as well as those found in their spatial neighborhood. Second, learning refines the points by finding connected landmark components and training a classifier to distinguish these from common building components. Third, landmark components are aggregated into complete landmark buildings. Experiments on city-scale point clouds show the viability and efficiency of our approach on various tasks.","PeriodicalId":425304,"journal":{"name":"2016 Fourth International Conference on 3D Vision (3DV)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125180548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}