Distinguishing edges caused by a change in depth from other types of edges is an important problem in early vision. We investigate the performance of humans and computer vision models on this task. We use spherical imagery with ground-truth LiDAR range data to build an objective ground-truth dataset for edge classification. We compare various computational models for classifying depth from non-depth edges in small images patches and achieve the best performance (86%) with a convolutional neural network. We investigate human performance on this task in a behavioral experiment and find that human performance is lower than the CNN. Although human and CNN depth responses are correlated, observers' responses are better predicted by other observers than by the CNN. The responses of CNNs and human observers also show a slightly different pattern of correlation with low-level edge cues, which suggests that CNNs and human observers may weight these features differently for classifying edges.
{"title":"Local Depth Edge Detection in Humans and Deep Neural Networks","authors":"Krista A. Ehinger, E. Graf, W. Adams, J. Elder","doi":"10.1109/ICCVW.2017.316","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.316","url":null,"abstract":"Distinguishing edges caused by a change in depth from other types of edges is an important problem in early vision. We investigate the performance of humans and computer vision models on this task. We use spherical imagery with ground-truth LiDAR range data to build an objective ground-truth dataset for edge classification. We compare various computational models for classifying depth from non-depth edges in small images patches and achieve the best performance (86%) with a convolutional neural network. We investigate human performance on this task in a behavioral experiment and find that human performance is lower than the CNN. Although human and CNN depth responses are correlated, observers' responses are better predicted by other observers than by the CNN. The responses of CNNs and human observers also show a slightly different pattern of correlation with low-level edge cues, which suggests that CNNs and human observers may weight these features differently for classifying edges.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128928736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a graph-based semi-supervised elastic embedding method as well as its kernelized version for face image embedding and classification. The proposed frameworks combines Flexible Manifold Embedding and non-linear graph based embedding for semi-supervised learning. In both proposed methods, the nonlinear manifold and the mapping (linear transform for the linear method and the kernel multipliers for the kernelized method) are simultaneously estimated, which overcomes the shortcomings of a cascaded estimation. Unlike many state-of-the art non-linear embedding approaches which suffer from the out-of-sample problem, our proposed methods have a direct out-of-sample extension to novel samples. We conduct experiments for tackling the face recognition and image-based face orientation problems on four public databases. These experiments show improvement over the state-of-the-art algorithms that are based on label propagation or graph-based semi-supervised embedding.
{"title":"Margin Based Semi-Supervised Elastic Embedding for Face Image Analysis","authors":"F. Dornaika, Y. E. Traboulsi","doi":"10.1109/ICCVW.2017.156","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.156","url":null,"abstract":"This paper introduces a graph-based semi-supervised elastic embedding method as well as its kernelized version for face image embedding and classification. The proposed frameworks combines Flexible Manifold Embedding and non-linear graph based embedding for semi-supervised learning. In both proposed methods, the nonlinear manifold and the mapping (linear transform for the linear method and the kernel multipliers for the kernelized method) are simultaneously estimated, which overcomes the shortcomings of a cascaded estimation. Unlike many state-of-the art non-linear embedding approaches which suffer from the out-of-sample problem, our proposed methods have a direct out-of-sample extension to novel samples. We conduct experiments for tackling the face recognition and image-based face orientation problems on four public databases. These experiments show improvement over the state-of-the-art algorithms that are based on label propagation or graph-based semi-supervised embedding.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116355711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative adversarial networks (GANs) can be used to learn a generation function from a joint probability distribution as an input, and then visual samples with semantic properties can be generated from a marginal probability distribution. In this paper, we propose a novel algorithm named Max-Boost-GAN, which is demonstrated to boost the generative ability of GANs when the error of generation is upper bounded. Moreover, the Max-Boost-GAN can be used to learn the generation functions from two marginal probability distributions as the input, and samples of higher visual quality and variety could be generated from the joint probability distribution. Finally, novel objective functions are proposed for obtaining convergence during training the Max-Boost-GAN. Experiments on the generation of binary digits and RGB human faces show that the Max-Boost-GAN achieves boosted ability of generation as expected.
{"title":"Max-Boost-GAN: Max Operation to Boost Generative Ability of Generative Adversarial Networks","authors":"Xinhan Di, Pengqian Yu","doi":"10.1109/ICCVW.2017.140","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.140","url":null,"abstract":"Generative adversarial networks (GANs) can be used to learn a generation function from a joint probability distribution as an input, and then visual samples with semantic properties can be generated from a marginal probability distribution. In this paper, we propose a novel algorithm named Max-Boost-GAN, which is demonstrated to boost the generative ability of GANs when the error of generation is upper bounded. Moreover, the Max-Boost-GAN can be used to learn the generation functions from two marginal probability distributions as the input, and samples of higher visual quality and variety could be generated from the joint probability distribution. Finally, novel objective functions are proposed for obtaining convergence during training the Max-Boost-GAN. Experiments on the generation of binary digits and RGB human faces show that the Max-Boost-GAN achieves boosted ability of generation as expected.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126629739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Designing autonomous vehicles suitable for urban environments remains an unresolved problem. One of the major dilemmas faced by autonomous cars is how to understand the intention of other road users and communicate with them. The existing datasets do not provide the necessary means for such higher level analysis of traffic scenes. With this in mind, we introduce a novel dataset which in addition to providing the bounding box information for pedestrian detection, also includes the behavioral and contextual annotations for the scenes. This allows combining visual and semantic information for better understanding of pedestrians' intentions in various traffic scenarios. We establish baseline approaches for analyzing the data and show that combining visual and contextual information can improve prediction of pedestrian intention at the point of crossing by at least 20%.
{"title":"Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior","authors":"Amir Rasouli, Iuliia Kotseruba, John K. Tsotsos","doi":"10.1109/ICCVW.2017.33","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.33","url":null,"abstract":"Designing autonomous vehicles suitable for urban environments remains an unresolved problem. One of the major dilemmas faced by autonomous cars is how to understand the intention of other road users and communicate with them. The existing datasets do not provide the necessary means for such higher level analysis of traffic scenes. With this in mind, we introduce a novel dataset which in addition to providing the bounding box information for pedestrian detection, also includes the behavioral and contextual annotations for the scenes. This allows combining visual and semantic information for better understanding of pedestrians' intentions in various traffic scenarios. We establish baseline approaches for analyzing the data and show that combining visual and contextual information can improve prediction of pedestrian intention at the point of crossing by at least 20%.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128787549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we propose a method to find the location of crop plants in Unmanned Aerial Vehicle (UAV) imagery. Finding the location of plants is a crucial step to derive and track phenotypic traits for each plant. We describe some initial work in estimating field crop plant locations. We approach the problem by classifying pixels as a plant center or a non plant center. We use Multiple Instance Learning (MIL) to handle the ambiguity of plant center labeling in training data. The classification results are then post-processed to estimate the exact location of the crop plant. Experimental evaluation is conducted to evaluate the method and the result achieved an overall precision and recall of 66% and 64%, respectively.
{"title":"Locating Crop Plant Centers from UAV-Based RGB Imagery","authors":"Yuhao Chen, Javier Ribera, C. Boomsma, E. Delp","doi":"10.1109/ICCVW.2017.238","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.238","url":null,"abstract":"In this paper we propose a method to find the location of crop plants in Unmanned Aerial Vehicle (UAV) imagery. Finding the location of plants is a crucial step to derive and track phenotypic traits for each plant. We describe some initial work in estimating field crop plant locations. We approach the problem by classifying pixels as a plant center or a non plant center. We use Multiple Instance Learning (MIL) to handle the ambiguity of plant center labeling in training data. The classification results are then post-processed to estimate the exact location of the crop plant. Experimental evaluation is conducted to evaluate the method and the result achieved an overall precision and recall of 66% and 64%, respectively.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130721906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a novel method for fusing geometric and appearance cues for road surface segmentation. Modeling colour cues using Gaussian mixtures allows the fusion to be performed optimally within a Bayesian framework, avoiding ad hoc weights. Adaptation to different scene conditions is accomplished through nearest-neighbour appearance model selection over a dictionary of mixture models learned from training data, and the thorny problem of selecting the number of components in each mixture is solved through a novel cross-validation approach. Quantitative evaluation reveals that the proposed fusion method significantly improves segmentation accuracy relative to a method that uses geometric cues alone.
{"title":"Fusing Geometry and Appearance for Road Segmentation","authors":"Gong Cheng, Yiming Qian, J. Elder","doi":"10.1109/ICCVW.2017.28","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.28","url":null,"abstract":"We propose a novel method for fusing geometric and appearance cues for road surface segmentation. Modeling colour cues using Gaussian mixtures allows the fusion to be performed optimally within a Bayesian framework, avoiding ad hoc weights. Adaptation to different scene conditions is accomplished through nearest-neighbour appearance model selection over a dictionary of mixture models learned from training data, and the thorny problem of selecting the number of components in each mixture is solved through a novel cross-validation approach. Quantitative evaluation reveals that the proposed fusion method significantly improves segmentation accuracy relative to a method that uses geometric cues alone.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116768027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual attention is a smart mechanism performed by the brain to avoid unnecessary processing and to focus on the most relevant part of the visual scene. It can result in a remarkable reduction in the computational complexity of scene understanding. Two major kinds of top-down visual attention signals are spatial and feature-based attention. The former deals with the places in scene which are worth to attend, while the latter is more involved with the basic features of objects e.g. color, intensity, edges. In principle, there are two known sources of generating a spatial attention signal: Frontal Eye Field (FEF) in the prefrontal cortex and Lateral Intraparietal Cortex (LIP) in the parietal cortex. In this paper, first, a combined neuro-computational model of ventral and dorsal stream is introduced and then, it is shown in Virtual Reality (VR) that the spatial attention, provided by LIP, acts as a transsaccadic memory pointer which accelerates object localization.
{"title":"Spatial Attention Improves Object Localization: A Biologically Plausible Neuro-Computational Model for Use in Virtual Reality","authors":"A. Jamalian, Julia Bergelt, H. Dinkelbach","doi":"10.1109/ICCVW.2017.320","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.320","url":null,"abstract":"Visual attention is a smart mechanism performed by the brain to avoid unnecessary processing and to focus on the most relevant part of the visual scene. It can result in a remarkable reduction in the computational complexity of scene understanding. Two major kinds of top-down visual attention signals are spatial and feature-based attention. The former deals with the places in scene which are worth to attend, while the latter is more involved with the basic features of objects e.g. color, intensity, edges. In principle, there are two known sources of generating a spatial attention signal: Frontal Eye Field (FEF) in the prefrontal cortex and Lateral Intraparietal Cortex (LIP) in the parietal cortex. In this paper, first, a combined neuro-computational model of ventral and dorsal stream is introduced and then, it is shown in Virtual Reality (VR) that the spatial attention, provided by LIP, acts as a transsaccadic memory pointer which accelerates object localization.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"39 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132555873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tingzhu Bai, Jianing Yang, Jun Chen, Xian Guo, Xiangsheng Huang, Yu-Ni Yao
Deep Reinforcement learning enables autonomous robots to learn large repertories of behavioral skill with minimal human intervention. However, the applications of direct deep reinforcement learning have been restricted. For complicated robotic systems, these limitations result from high dimensional action space, high freedom of robotic system and high correlation between images. In this paper we introduce a new definition of action space and propose a double-task deep Q-Network with multiple views (DMDQN) based on double-DQN and dueling-DQN. For extension, we define multi-task model for more complex jobs. Moreover data augment policy is applied, which includes auto-sampling and action-overturn. The exploration policy is formed when DMDQN and data augment are combined. For robotic system's steady exploration, we designed the safety constraints according to working condition. Our experiments show that our double-task DQN with multiple views performs better than the single-task and single-view model. Combining our DMDQN and data augment, the robotic system can reach the object in an exploration way.
{"title":"Double-Task Deep Q-Learning with Multiple Views","authors":"Tingzhu Bai, Jianing Yang, Jun Chen, Xian Guo, Xiangsheng Huang, Yu-Ni Yao","doi":"10.1109/ICCVW.2017.128","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.128","url":null,"abstract":"Deep Reinforcement learning enables autonomous robots to learn large repertories of behavioral skill with minimal human intervention. However, the applications of direct deep reinforcement learning have been restricted. For complicated robotic systems, these limitations result from high dimensional action space, high freedom of robotic system and high correlation between images. In this paper we introduce a new definition of action space and propose a double-task deep Q-Network with multiple views (DMDQN) based on double-DQN and dueling-DQN. For extension, we define multi-task model for more complex jobs. Moreover data augment policy is applied, which includes auto-sampling and action-overturn. The exploration policy is formed when DMDQN and data augment are combined. For robotic system's steady exploration, we designed the safety constraints according to working condition. Our experiments show that our double-task DQN with multiple views performs better than the single-task and single-view model. Combining our DMDQN and data augment, the robotic system can reach the object in an exploration way.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"239 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132821971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skin colour forms a curved manifold in RGB space. The variations in skin colour are largely caused by variations in concentration of the pigments melanin and hemoglobin. Hence, linear statistical models of appearance or skin albedo are insufficiently constrained (they can produce implausible skin tones) and lack compactness (they require additional dimensions to linearly approximate a curved manifold). In this paper, we propose to use a biophysical model of skin colouration in order to transform skin colour into a parameter space where linear statistical modelling can take place. Hence, we propose a hybrid of biophysical and statistical modelling. We present a two parameter spectral model of skin colouration, methods for fitting the model to data captured in a lightstage and then build our hybrid model on a sample of such registered data. We present face editing results and compare our model against a pure statistical model built directly on textures.
{"title":"A Biophysical 3D Morphable Model of Face Appearance","authors":"S. Alotaibi, W. Smith","doi":"10.1109/ICCVW.2017.102","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.102","url":null,"abstract":"Skin colour forms a curved manifold in RGB space. The variations in skin colour are largely caused by variations in concentration of the pigments melanin and hemoglobin. Hence, linear statistical models of appearance or skin albedo are insufficiently constrained (they can produce implausible skin tones) and lack compactness (they require additional dimensions to linearly approximate a curved manifold). In this paper, we propose to use a biophysical model of skin colouration in order to transform skin colour into a parameter space where linear statistical modelling can take place. Hence, we propose a hybrid of biophysical and statistical modelling. We present a two parameter spectral model of skin colouration, methods for fitting the model to data captured in a lightstage and then build our hybrid model on a sample of such registered data. We present face editing results and compare our model against a pure statistical model built directly on textures.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131802351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Delia Fernandez, David Varas, Joan Espadaler, Issey Masuda, Jordi Ferreira, A. Woodward, David Rodriguez, Xavier Giró-i-Nieto, J. C. Riveiro, Elisenda Bou
The popularization of multimedia content on the Web has arised the need to automatically understand, index and retrieve it. In this paper we present ViTS, an automatic Video Tagging System which learns from videos, their web context and comments shared on social networks. ViTS analyses massive multimedia collections by Internet crawling, and maintains a knowledge base that updates in real time with no need of human supervision. As a result, each video is indexed with a rich set of labels and linked with other related contents. ViTS is an industrial product under exploitation with a vocabulary of over 2.5M concepts, capable of indexing more than 150k videos per month. We compare the quality and completeness of our tags with respect to the ones in the YouTube-8M dataset, and we show how ViTS enhances the semantic annotation of the videos with a larger number of labels (10.04 tags/video), with an accuracy of 80,87%. Extracted tags and video summaries are publicly available.1
{"title":"ViTS: Video Tagging System from Massive Web Multimedia Collections","authors":"Delia Fernandez, David Varas, Joan Espadaler, Issey Masuda, Jordi Ferreira, A. Woodward, David Rodriguez, Xavier Giró-i-Nieto, J. C. Riveiro, Elisenda Bou","doi":"10.1109/ICCVW.2017.48","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.48","url":null,"abstract":"The popularization of multimedia content on the Web has arised the need to automatically understand, index and retrieve it. In this paper we present ViTS, an automatic Video Tagging System which learns from videos, their web context and comments shared on social networks. ViTS analyses massive multimedia collections by Internet crawling, and maintains a knowledge base that updates in real time with no need of human supervision. As a result, each video is indexed with a rich set of labels and linked with other related contents. ViTS is an industrial product under exploitation with a vocabulary of over 2.5M concepts, capable of indexing more than 150k videos per month. We compare the quality and completeness of our tags with respect to the ones in the YouTube-8M dataset, and we show how ViTS enhances the semantic annotation of the videos with a larger number of labels (10.04 tags/video), with an accuracy of 80,87%. Extracted tags and video summaries are publicly available.1","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131841558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}