Hand segmentation is one of the most fundamental and crucial steps for egocentric human-computer interaction. The special egocentric view brings new challenges to hand segmentation task, such as the unpredictable environmental conditions. The performance of traditional hand segmentation methods depend on abundant manually labeled training data. However, these approaches do not appropriately capture the whole properties of egocentric human-computer interaction for neglecting the user-specific context. It is only necessary to build a personalized hand model of the active user. Based on this observation, we propose an online-learning hand segmentation approach without using manually labeled data for training. Our approach consists of top-down classifications and bottom-up optimizations. More specifically, we divide the segmentation task into three parts, a frame-level hand detection which detects the presence of the interactive hand using motion saliency and initializes hand masks for online learning, a superpixel-level hand classification which coarsely segments hand regions from which stable samples are selected for next level, and a pixel-level hand classification which produces a fine-grained hand segmentation. Based on the pixel-level classification result, we update the hand appearance model and optimize the upper layer classifier and detector. This online-learning strategy makes our approach robust to varying illumination conditions and hand appearances. Experimental results demonstrate the robustness of our approach.
{"title":"Unsupervised Online Learning for Fine-Grained Hand Segmentation in Egocentric Video","authors":"Ying Zhao, Zhiwei Luo, Changqin Quan","doi":"10.1109/CRV.2017.17","DOIUrl":"https://doi.org/10.1109/CRV.2017.17","url":null,"abstract":"Hand segmentation is one of the most fundamental and crucial steps for egocentric human-computer interaction. The special egocentric view brings new challenges to hand segmentation task, such as the unpredictable environmental conditions. The performance of traditional hand segmentation methods depend on abundant manually labeled training data. However, these approaches do not appropriately capture the whole properties of egocentric human-computer interaction for neglecting the user-specific context. It is only necessary to build a personalized hand model of the active user. Based on this observation, we propose an online-learning hand segmentation approach without using manually labeled data for training. Our approach consists of top-down classifications and bottom-up optimizations. More specifically, we divide the segmentation task into three parts, a frame-level hand detection which detects the presence of the interactive hand using motion saliency and initializes hand masks for online learning, a superpixel-level hand classification which coarsely segments hand regions from which stable samples are selected for next level, and a pixel-level hand classification which produces a fine-grained hand segmentation. Based on the pixel-level classification result, we update the hand appearance model and optimize the upper layer classifier and detector. This online-learning strategy makes our approach robust to varying illumination conditions and hand appearances. Experimental results demonstrate the robustness of our approach.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122240783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual Odometry (VO) is a key enabling technology for mobile robotic systems that provides a relative motion estimate from a sequence of camera images. Cameras are comparatively inexpensive sensors, and provide large amounts of useful data, making them one of the most common sensors in mobile robotics. However, because they are passive, they are dependent on external lighting, which can restrict their usefulness. Using headlights as an alternate lighting source, this paper investigates outdoor stereo VO performance under all lighting conditions during nearly 10 km of driving over 30 hours. Challenges include limited visibility range, a dynamic light source, intensity hotspots, and others. Another large issue comes from blooming and lens flare at dawn and dusk, when the camera is looking directly into the sun. In our experiments, nighttime driving with headlights has a moderately increased error of 2.38% over 250 m compared to the daytime error of 1.5%. To the best of our knowledge this is the first quantitative study of VO performance at night using headlights.
{"title":"Night Rider: Visual Odometry Using Headlights","authors":"K. MacTavish, M. Paton, T. Barfoot","doi":"10.1109/CRV.2017.48","DOIUrl":"https://doi.org/10.1109/CRV.2017.48","url":null,"abstract":"Visual Odometry (VO) is a key enabling technology for mobile robotic systems that provides a relative motion estimate from a sequence of camera images. Cameras are comparatively inexpensive sensors, and provide large amounts of useful data, making them one of the most common sensors in mobile robotics. However, because they are passive, they are dependent on external lighting, which can restrict their usefulness. Using headlights as an alternate lighting source, this paper investigates outdoor stereo VO performance under all lighting conditions during nearly 10 km of driving over 30 hours. Challenges include limited visibility range, a dynamic light source, intensity hotspots, and others. Another large issue comes from blooming and lens flare at dawn and dusk, when the camera is looking directly into the sun. In our experiments, nighttime driving with headlights has a moderately increased error of 2.38% over 250 m compared to the daytime error of 1.5%. To the best of our knowledge this is the first quantitative study of VO performance at night using headlights.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129839645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The task of transferring human knowledge and capabilities to robots is still an open problem. In this paper, we address the problem of transferring human grasping locations of a particular object to a robot manipulator. Using an RGBD sensor, we propose a computer vision based method for human hand detection. This method implements a pixelwise hand detection method with the Random Forest classification algorithm in the color channel. It also creates a kernel-based hand detection method in the depth channel. Based on the theory of joint probability, it fuses both color and depth cues. As a result, this method is able to deal with noisy background and occlusion. Moreover, we apply this method to a grasping task example. In our test, the robot is able to gain the grasping knowledge from visual observation. Our method is complemented with experimental results on the settings of four different sequences with different level of difficulties, and has achieved high performance with respect to hand detection accuracy in comparison with RGB and Depth only methods.
{"title":"Towards Transferring Grasping from Human to Robot with RGBD Hand Detection","authors":"Rong Feng, Camilo Perez, Hong Zhang","doi":"10.1109/CRV.2017.45","DOIUrl":"https://doi.org/10.1109/CRV.2017.45","url":null,"abstract":"The task of transferring human knowledge and capabilities to robots is still an open problem. In this paper, we address the problem of transferring human grasping locations of a particular object to a robot manipulator. Using an RGBD sensor, we propose a computer vision based method for human hand detection. This method implements a pixelwise hand detection method with the Random Forest classification algorithm in the color channel. It also creates a kernel-based hand detection method in the depth channel. Based on the theory of joint probability, it fuses both color and depth cues. As a result, this method is able to deal with noisy background and occlusion. Moreover, we apply this method to a grasping task example. In our test, the robot is able to gain the grasping knowledge from visual observation. Our method is complemented with experimental results on the settings of four different sequences with different level of difficulties, and has achieved high performance with respect to hand detection accuracy in comparison with RGB and Depth only methods.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":" 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113947999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a bimodal biometric recognition system based on iris and palmprint. Different wavelet-based filters including log Gabor, Discrete Cosine Transform (DCT), Walsh and Haar are used to extract features from images. Then we fuse iris and palmprint at the feature level by concatenating the feature vectors from two modalities. Since wavelet transforms generate huge number of features, a dimensionality reduction step is necessary to make the classification and matching steps tractable and computationally feasible. In this paper, two well-known dimensionality reduction algorithms including Laplacian eigenmaps and Singular Value Decomposition (SVD) are used to reduce the size of feature space. Applying these dimensionality reduction methods not only decreases the computational cost of matching remarkably but also it improves the accuracy of recognition by reducing the unnecessary model complexity. Eventually multiple classification techniques are used in the transformed feature spaces for the final matching and recognition. CASIA datasets for iris and palmprint are used in this study. The experiments show the effectiveness of our feature level fusion method and also the dimensionality reduction methods we used. Based on our experiments, our multimodal biometric system always outperforms the unimodal recognition systems with higher accuracy. Moreover, an appropriate dimensionality reduction algorithm always helps to improve the accuracy of classifier. Finally, the log Gabor filter extracts the most discriminative features from images compared to other wavelet transforms.
{"title":"Manifold Learning of Overcomplete Feature Spaces in a Multimodal Biometric Recognition System of Iris and Palmprint","authors":"Habibeh Naderi, Behrouz Haji Soleimani, S. Matwin","doi":"10.1109/CRV.2017.29","DOIUrl":"https://doi.org/10.1109/CRV.2017.29","url":null,"abstract":"This paper presents a bimodal biometric recognition system based on iris and palmprint. Different wavelet-based filters including log Gabor, Discrete Cosine Transform (DCT), Walsh and Haar are used to extract features from images. Then we fuse iris and palmprint at the feature level by concatenating the feature vectors from two modalities. Since wavelet transforms generate huge number of features, a dimensionality reduction step is necessary to make the classification and matching steps tractable and computationally feasible. In this paper, two well-known dimensionality reduction algorithms including Laplacian eigenmaps and Singular Value Decomposition (SVD) are used to reduce the size of feature space. Applying these dimensionality reduction methods not only decreases the computational cost of matching remarkably but also it improves the accuracy of recognition by reducing the unnecessary model complexity. Eventually multiple classification techniques are used in the transformed feature spaces for the final matching and recognition. CASIA datasets for iris and palmprint are used in this study. The experiments show the effectiveness of our feature level fusion method and also the dimensionality reduction methods we used. Based on our experiments, our multimodal biometric system always outperforms the unimodal recognition systems with higher accuracy. Moreover, an appropriate dimensionality reduction algorithm always helps to improve the accuracy of classifier. Finally, the log Gabor filter extracts the most discriminative features from images compared to other wavelet transforms.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123741333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ [1] and MLH [1] and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks.
{"title":"Compact Environment-Invariant Codes for Robust Visual Place Recognition","authors":"Unnat Jain, Vinay P. Namboodiri, Gaurav Pandey","doi":"10.1109/CRV.2017.22","DOIUrl":"https://doi.org/10.1109/CRV.2017.22","url":null,"abstract":"Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ [1] and MLH [1] and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115525178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Bonenfant, D. Laurendeau, Alexis Fortin-Côté, P. Cardou, C. Gosselin, C. Faure, B. McFadyen, C. Mercier, L. Bouyer
A Kinect-based pose estimation system is presented for the study of movement problems within a rehab context. The performance of the system is compared to ground-truth data obtained by an expensive MoCap system. The results show that the proposed system performs well and could be used within a virtual rehabilitation context synchronized with other systems (e.g., robots).
{"title":"A Computer Vision System for Virtual Rehabilitation","authors":"Michael Bonenfant, D. Laurendeau, Alexis Fortin-Côté, P. Cardou, C. Gosselin, C. Faure, B. McFadyen, C. Mercier, L. Bouyer","doi":"10.1109/CRV.2017.30","DOIUrl":"https://doi.org/10.1109/CRV.2017.30","url":null,"abstract":"A Kinect-based pose estimation system is presented for the study of movement problems within a rehab context. The performance of the system is compared to ground-truth data obtained by an expensive MoCap system. The results show that the proposed system performs well and could be used within a virtual rehabilitation context synchronized with other systems (e.g., robots).","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114544759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel user interface for aiming andlaunching flying robots on user-defined trajectories. The methodrequires no user instrumentation and is easy to learn by analogyto a slingshot. With a few minutes of practice users can sendrobots along a desired 3D trajectory and place them in 3D space, including at high altitude and beyond line-of-sight. With the robot hovering in front of the user, the robot tracksthe user's face to estimate its relative pose. The azimuth, elevationand distance of this pose control the parameters of the robot'ssubsequent trajectory. The user triggers the robot to fly thetrajectory by making a distinct pre-trained facial expression. Wepropose three different trajectory types for different applications:straight-line, parabola, and circling. We also describe a simple training/startup interaction to selecta trajectory type and train the aiming and triggering faces. Inreal-world experiments we demonstrate and evaluate the method. We also show that the face-recognition system is resistant to inputfrom unauthorized users.
{"title":"Ready—Aim—Fly! Hands-Free Face-Based HRI for 3D Trajectory Control of UAVs","authors":"Jake Bruce, Jacob M. Perron, R. Vaughan","doi":"10.1109/CRV.2017.39","DOIUrl":"https://doi.org/10.1109/CRV.2017.39","url":null,"abstract":"We present a novel user interface for aiming andlaunching flying robots on user-defined trajectories. The methodrequires no user instrumentation and is easy to learn by analogyto a slingshot. With a few minutes of practice users can sendrobots along a desired 3D trajectory and place them in 3D space, including at high altitude and beyond line-of-sight. With the robot hovering in front of the user, the robot tracksthe user's face to estimate its relative pose. The azimuth, elevationand distance of this pose control the parameters of the robot'ssubsequent trajectory. The user triggers the robot to fly thetrajectory by making a distinct pre-trained facial expression. Wepropose three different trajectory types for different applications:straight-line, parabola, and circling. We also describe a simple training/startup interaction to selecta trajectory type and train the aiming and triggering faces. Inreal-world experiments we demonstrate and evaluate the method. We also show that the face-recognition system is resistant to inputfrom unauthorized users.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130876381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop an approach for unsupervised learning of associations between co-occurring perceptual events using a large graph. We applied this approach to successfully solve the image captcha of China's railroad system. The approach is based on the principle of suspicious coincidence, originally proposed by Barlow [1], who argued that the brain builds a statistical model of the world by learning associations between events that repeatedly co-occur. In this particular problem, a user is presented with a deformed picture of a Chinese phrase and eight low-resolution images. They must quickly select the relevant images in order to purchase their train tickets. This problem presents several challenges: (1) the teaching labels for both the Chinese phrases and the images were not available for supervised learning, (2) no pre-trained deep convolutional neural networks are available for recognizing these Chinese phrases or the presented images, and (3) each captcha must be solved within a few seconds. We collected 2.6 million captchas, with 2.6 million deformed Chinese phrases and over 21 million images. From these data, we constructed an association graph, composed of over 6 million vertices, and linked these vertices based on co-occurrence information and feature similarity between pairs of images. We then trained a deep convolutional neural network to learn a projection of the Chinese phrases onto a 230- dimensional latent space. Using label propagation, we computed the likelihood of each of the eight images conditioned on the latent space projection of the deformed phrase for each captcha. The resulting system solved captchas with 77% accuracy in 2 seconds on average. Our work, in answering this practical challenge, illustrates the power of this class of unsupervised association learning techniques, which may be related to the brain's general strategy for associating language stimuli with visual objects on the principle of suspicious coincidence.
{"title":"Learning to Associate Words and Images Using a Large-Scale Graph","authors":"Heqing Ya, Haonan Sun, Jeffrey Helt, T. Lee","doi":"10.1109/CRV.2017.52","DOIUrl":"https://doi.org/10.1109/CRV.2017.52","url":null,"abstract":"We develop an approach for unsupervised learning of associations between co-occurring perceptual events using a large graph. We applied this approach to successfully solve the image captcha of China's railroad system. The approach is based on the principle of suspicious coincidence, originally proposed by Barlow [1], who argued that the brain builds a statistical model of the world by learning associations between events that repeatedly co-occur. In this particular problem, a user is presented with a deformed picture of a Chinese phrase and eight low-resolution images. They must quickly select the relevant images in order to purchase their train tickets. This problem presents several challenges: (1) the teaching labels for both the Chinese phrases and the images were not available for supervised learning, (2) no pre-trained deep convolutional neural networks are available for recognizing these Chinese phrases or the presented images, and (3) each captcha must be solved within a few seconds. We collected 2.6 million captchas, with 2.6 million deformed Chinese phrases and over 21 million images. From these data, we constructed an association graph, composed of over 6 million vertices, and linked these vertices based on co-occurrence information and feature similarity between pairs of images. We then trained a deep convolutional neural network to learn a projection of the Chinese phrases onto a 230- dimensional latent space. Using label propagation, we computed the likelihood of each of the eight images conditioned on the latent space projection of the deformed phrase for each captcha. The resulting system solved captchas with 77% accuracy in 2 seconds on average. Our work, in answering this practical challenge, illustrates the power of this class of unsupervised association learning techniques, which may be related to the brain's general strategy for associating language stimuli with visual objects on the principle of suspicious coincidence.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recurrent feedback connections in the mammalian visual system have been hypothesized to play a role in synthesizing input in the theoretical framework of analysis by synthesis. The comparison of internally synthesized representation with that of the input provides a validation mechanism during perceptual inference and learning. Inspired by these ideas, we proposed that the synthesis machinery can compose new, unobserved images by imagination to train the network itself so as to increase the robustness of the system in novel scenarios. As a proof of concept, we investigated whether images composed by imagination could help an object recognition system to deal with occlusion, which is challenging for the current state-of-the-art deep convolutional neural networks. We fine-tuned a network on images containing objects in various occlusion scenarios, that are imagined or self-generated through a deep generator network. Trained on imagined occluded scenarios under the object persistence constraint, our network discovered more subtle and localized image features that were neglected by the original network for object classification, obtaining better separability of different object classes in the feature space. This leads to significant improvement of object recognition under occlusion for our network relative to the original network trained only on un-occluded images. In addition to providing practical benefits in object recognition under occlusion, this work demonstrates the use of self-generated composition of visual scenes through the synthesis loop, combined with the object persistence constraint, can provide opportunities for neural networks to discover new relevant patterns in the data, and become more flexible in dealing with novel situations.
{"title":"Learning Robust Object Recognition Using Composed Scenes from Generative Models","authors":"Hao Wang, Xingyu Lin, Yimeng Zhang, T. Lee","doi":"10.1109/CRV.2017.42","DOIUrl":"https://doi.org/10.1109/CRV.2017.42","url":null,"abstract":"Recurrent feedback connections in the mammalian visual system have been hypothesized to play a role in synthesizing input in the theoretical framework of analysis by synthesis. The comparison of internally synthesized representation with that of the input provides a validation mechanism during perceptual inference and learning. Inspired by these ideas, we proposed that the synthesis machinery can compose new, unobserved images by imagination to train the network itself so as to increase the robustness of the system in novel scenarios. As a proof of concept, we investigated whether images composed by imagination could help an object recognition system to deal with occlusion, which is challenging for the current state-of-the-art deep convolutional neural networks. We fine-tuned a network on images containing objects in various occlusion scenarios, that are imagined or self-generated through a deep generator network. Trained on imagined occluded scenarios under the object persistence constraint, our network discovered more subtle and localized image features that were neglected by the original network for object classification, obtaining better separability of different object classes in the feature space. This leads to significant improvement of object recognition under occlusion for our network relative to the original network trained only on un-occluded images. In addition to providing practical benefits in object recognition under occlusion, this work demonstrates the use of self-generated composition of visual scenes through the synthesis loop, combined with the object persistence constraint, can provide opportunities for neural networks to discover new relevant patterns in the data, and become more flexible in dealing with novel situations.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122330857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Using different priors (e.g. shape and appearance) have proven critical for robust image segmentation of different types of target objects. Many existing methods for extracting trees (e.g. vascular or airway trees) from medical images have leveraged appearance priors (e.g. tubular-ness and bifurcationness) and the knowledge of the cross-sectional geometry (e.g. circles or ellipses) of the tree-forming tubes. In this work, we present the first method for 3D tree extraction from 3D medical images (e.g. CT or MRI) that, in addition to appearance and cross-sectional geometry priors, utilizes prior tree statistics collected from the training data. Our tree extraction method collects and leverages topological tree prior and geometrical statistics, including tree hierarchy, branch angle and length statistics. Our implementation takes the form of a Bayesian tree centerline tracking method combining the aforementioned tree priors with observed image data. We evaluated our method on both synthetic 3D datasets and real clinical CT chest datasets. For synthetic data, our method's key feature of incorporating tree priors resulted in at least 13% increase in correctly detected branches under different noise levels. For real clinical scans, the mean distance from ground truth centerlines to the detected centerlines by our method was improved by 12% when utilizing tree priors. Both experiments validate that, by incorporating tree statistics, our tree extraction method becomes more robust to noise and provides more accurate branch localization.
{"title":"Leveraging Tree Statistics for Extracting Anatomical Trees from 3D Medical Images","authors":"Mengliu Zhao, Brandon Miles, G. Hamarneh","doi":"10.1109/CRV.2017.15","DOIUrl":"https://doi.org/10.1109/CRV.2017.15","url":null,"abstract":"Using different priors (e.g. shape and appearance) have proven critical for robust image segmentation of different types of target objects. Many existing methods for extracting trees (e.g. vascular or airway trees) from medical images have leveraged appearance priors (e.g. tubular-ness and bifurcationness) and the knowledge of the cross-sectional geometry (e.g. circles or ellipses) of the tree-forming tubes. In this work, we present the first method for 3D tree extraction from 3D medical images (e.g. CT or MRI) that, in addition to appearance and cross-sectional geometry priors, utilizes prior tree statistics collected from the training data. Our tree extraction method collects and leverages topological tree prior and geometrical statistics, including tree hierarchy, branch angle and length statistics. Our implementation takes the form of a Bayesian tree centerline tracking method combining the aforementioned tree priors with observed image data. We evaluated our method on both synthetic 3D datasets and real clinical CT chest datasets. For synthetic data, our method's key feature of incorporating tree priors resulted in at least 13% increase in correctly detected branches under different noise levels. For real clinical scans, the mean distance from ground truth centerlines to the detected centerlines by our method was improved by 12% when utilizing tree priors. Both experiments validate that, by incorporating tree statistics, our tree extraction method becomes more robust to noise and provides more accurate branch localization.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123548186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}