Kiret Dhindsa, Lauren C. Smail, M. McGrath, Luis H. Braga, S. Becker, R. Sonnadara
We evaluate the performance of a Deep Convolutional Neural Network in grading the severity of prenatal hydronephrosis (PHN), one of the most common congenital urological anomalies, from renal ultrasound images. We present results on a variety of classification tasks based on clinically defined grades of severity, including predictions of whether or not an ultrasound image represents a case that is at high risk for further complications requiring surgical intervention with approximately 80% accuracy. The prediction rates obtained by the model are well beyond the rates of agreement among trained clinicians, suggesting that this work can lead to a useful diagnostic aid.
{"title":"Grading Prenatal Hydronephrosis from Ultrasound Imaging Using Deep Convolutional Neural Networks","authors":"Kiret Dhindsa, Lauren C. Smail, M. McGrath, Luis H. Braga, S. Becker, R. Sonnadara","doi":"10.1109/CRV.2018.00021","DOIUrl":"https://doi.org/10.1109/CRV.2018.00021","url":null,"abstract":"We evaluate the performance of a Deep Convolutional Neural Network in grading the severity of prenatal hydronephrosis (PHN), one of the most common congenital urological anomalies, from renal ultrasound images. We present results on a variety of classification tasks based on clinically defined grades of severity, including predictions of whether or not an ultrasound image represents a case that is at high risk for further complications requiring surgical intervention with approximately 80% accuracy. The prediction rates obtained by the model are well beyond the rates of agreement among trained clinicians, suggesting that this work can lead to a useful diagnostic aid.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"48 27","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120812007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a system for accurate real-time 3D face verification using a low-quality consumer depth camera. To verify the identity of a subject, we built a high-quality reference model offline by fitting a 3D morphable model to a sequence of low-quality depth images. At runtime, we compare the similarity between the reference model and a single depth image by aligning the model to the image and measuring differences between every point on the two facial surfaces. The model and the image will not match exactly due to sensor noise, occlusions, as well as changes in expression, hairstyle, and eye-wear; therefore, we leverage a data driven approach to determine whether or not the model and the image match. We train a random decision forest to verify the identity of a subject where the point-to-point distances between the reference model and the depth image are used as input features to the classifier. Our approach runs in real-time and is designed to continuously authenticate a user as he/she uses his/her device. In addition, our proposed method outperforms existing 2D and 3D face verification methods on a benchmark data set.
{"title":"Real-Time 3D Face Verification with a Consumer Depth Camera","authors":"Gregory P. Meyer, M. Do","doi":"10.1109/CRV.2018.00020","DOIUrl":"https://doi.org/10.1109/CRV.2018.00020","url":null,"abstract":"We present a system for accurate real-time 3D face verification using a low-quality consumer depth camera. To verify the identity of a subject, we built a high-quality reference model offline by fitting a 3D morphable model to a sequence of low-quality depth images. At runtime, we compare the similarity between the reference model and a single depth image by aligning the model to the image and measuring differences between every point on the two facial surfaces. The model and the image will not match exactly due to sensor noise, occlusions, as well as changes in expression, hairstyle, and eye-wear; therefore, we leverage a data driven approach to determine whether or not the model and the image match. We train a random decision forest to verify the identity of a subject where the point-to-point distances between the reference model and the depth image are used as input features to the classifier. Our approach runs in real-time and is designed to continuously authenticate a user as he/she uses his/her device. In addition, our proposed method outperforms existing 2D and 3D face verification methods on a benchmark data set.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127766981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Indoor Localization is a primary task for social robots. We are particularly interested in how to solve this problem for a mobile robot using primarily vision sensors. This work examines a critical issue related to generalizing approaches for static environments to dynamic ones: (i) it considers how to deal with dynamic users in the environment that obscure landmarks that are key to safe navigation, and (ii) it considers how standard localization approaches for static environments can be augmented to deal with dynamic agents (e.g., humans). We propose an approach which integrates wheel odometry with stereo visual odometry and perform a global pose refinement to overcome previously accumulated errors due to visual and wheel odometry. We evaluate our approach through a series of controlled experiments to see how localization performance varies with increasing number of dynamic agents present in the scene.
{"title":"Indoor Localization in Dynamic Human Environments Using Visual Odometry and Global Pose Refinement","authors":"Raghavender Sahdev, B. Chen, John K. Tsotsos","doi":"10.1109/CRV.2018.00057","DOIUrl":"https://doi.org/10.1109/CRV.2018.00057","url":null,"abstract":"Indoor Localization is a primary task for social robots. We are particularly interested in how to solve this problem for a mobile robot using primarily vision sensors. This work examines a critical issue related to generalizing approaches for static environments to dynamic ones: (i) it considers how to deal with dynamic users in the environment that obscure landmarks that are key to safe navigation, and (ii) it considers how standard localization approaches for static environments can be augmented to deal with dynamic agents (e.g., humans). We propose an approach which integrates wheel odometry with stereo visual odometry and perform a global pose refinement to overcome previously accumulated errors due to visual and wheel odometry. We evaluate our approach through a series of controlled experiments to see how localization performance varies with increasing number of dynamic agents present in the scene.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124822751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A promising approach to depth from defocus (DfD) involves actively projecting a quasi-random point pattern onto an object and assessing the blurriness of the point projection as captured by a camera to recover the depth of the scene. Recently, it was found that the depth inference can be made not only faster but also more accurate by leveraging deep learning approaches to computationally model and predict depth based on the quasi-random point projections as captured by a camera. Motivated by the fact that deep learning techniques can automatically learn useful features from the captured image of the projection, in this paper we present an extension of this quasi-random projection approach to DfD by introducing the use of a new quasi-random projection pattern consisting of complex subpatterns instead of points. The design and choice of the subpattern used in the quasi-random projection is a key factor in the ability to achieve improved depth recovery with high fidelity. Experimental results using quasi-random projection patterns composed of a variety of non-conventional subpattern designs on complex surfaces showed that the use of complex subpatterns in the quasi-random projection pattern can significantly improve depth reconstruction quality compared to a point pattern.
{"title":"Deep Learning-Driven Depth from Defocus via Active Multispectral Quasi-Random Projections with Complex Subpatterns","authors":"A. Ma, A. Wong, David A Clausi","doi":"10.1109/CRV.2018.00048","DOIUrl":"https://doi.org/10.1109/CRV.2018.00048","url":null,"abstract":"A promising approach to depth from defocus (DfD) involves actively projecting a quasi-random point pattern onto an object and assessing the blurriness of the point projection as captured by a camera to recover the depth of the scene. Recently, it was found that the depth inference can be made not only faster but also more accurate by leveraging deep learning approaches to computationally model and predict depth based on the quasi-random point projections as captured by a camera. Motivated by the fact that deep learning techniques can automatically learn useful features from the captured image of the projection, in this paper we present an extension of this quasi-random projection approach to DfD by introducing the use of a new quasi-random projection pattern consisting of complex subpatterns instead of points. The design and choice of the subpattern used in the quasi-random projection is a key factor in the ability to achieve improved depth recovery with high fidelity. Experimental results using quasi-random projection patterns composed of a variety of non-conventional subpattern designs on complex surfaces showed that the use of complex subpatterns in the quasi-random projection pattern can significantly improve depth reconstruction quality compared to a point pattern.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116388508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a simple real-time system that is able to track multiple faces for live videos, broadcast, real-time conference recording, etc. Our proposed tracking system is comprised of three parts: face detection, feature extraction and tracking. We employ a previously proposed cascaded Multi-Task Convolutional Neural Network (MTCNN) to detect a face, a simple CNN to extract the features of detected faces and show that a shallow network for face tracking based on the extracted feature maps of the face is sufficient. Our multi-face tracker runs in real-time without any on-line training. We do not adjust any parameters according to different input videos, and the tracker's run-time will not significantly increase with an increase in the number of faces being tracked, i.e., it is easy to deploy in new real-time applications. We evaluate our tracker based on two commonly used metrics in comparison to five recent face trackers. Our proposed simple tracker can perform competitively in comparison to these trackers despite occlusions in the videos and false positives or false negatives during face detection.
{"title":"Simple Real-Time Multi-face Tracking Based on Convolutional Neural Networks","authors":"Xile Li, J. Lang","doi":"10.1109/CRV.2018.00054","DOIUrl":"https://doi.org/10.1109/CRV.2018.00054","url":null,"abstract":"We present a simple real-time system that is able to track multiple faces for live videos, broadcast, real-time conference recording, etc. Our proposed tracking system is comprised of three parts: face detection, feature extraction and tracking. We employ a previously proposed cascaded Multi-Task Convolutional Neural Network (MTCNN) to detect a face, a simple CNN to extract the features of detected faces and show that a shallow network for face tracking based on the extracted feature maps of the face is sufficient. Our multi-face tracker runs in real-time without any on-line training. We do not adjust any parameters according to different input videos, and the tracker's run-time will not significantly increase with an increase in the number of faces being tracked, i.e., it is easy to deploy in new real-time applications. We evaluate our tracker based on two commonly used metrics in comparison to five recent face trackers. Our proposed simple tracker can perform competitively in comparison to these trackers despite occlusions in the videos and false positives or false negatives during face detection.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125643410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Modasshir, Alberto Quattrini Li, Ioannis M. Rekleitis
Deep Neural Networks (DNN) have gained tremendous popularity over the last years for several computer vision tasks, including classification and object detection. Such techniques have been able to achieve human-level performance in many tasks and have produced results of unprecedented accuracy. As DNNs have intense computational requirements in the majority of applications, they utilize a cluster of computers or a cutting edge Graphical Processing Unit (GPU), often having excessive power consumption and generating a lot of heat. In many robotics applications the above requirements prove to be a challenge, as there is limited power on-board and heat dissipation is always a problem. In particular in underwater robotics with limited space, the above two requirements have been proven prohibitive. As first of this kind, this paper aims at analyzing and comparing the performance of several state-of-the-art DNNs on different platforms. With a focus on the underwater domain, the capabilities of the Jetson TX2 from NVIDIA and the Neural Compute Stick from Intel are of particular interest. Experiments on standard datasets show how different platforms are usable on an actual robotic system, providing insights on the current state-of-the-art embedded systems. Based on such results, we propose some guidelines in choosing the appropriate platform and network architecture for a robotic system.
{"title":"Deep Neural Networks: A Comparison on Different Computing Platforms","authors":"M. Modasshir, Alberto Quattrini Li, Ioannis M. Rekleitis","doi":"10.1109/CRV.2018.00060","DOIUrl":"https://doi.org/10.1109/CRV.2018.00060","url":null,"abstract":"Deep Neural Networks (DNN) have gained tremendous popularity over the last years for several computer vision tasks, including classification and object detection. Such techniques have been able to achieve human-level performance in many tasks and have produced results of unprecedented accuracy. As DNNs have intense computational requirements in the majority of applications, they utilize a cluster of computers or a cutting edge Graphical Processing Unit (GPU), often having excessive power consumption and generating a lot of heat. In many robotics applications the above requirements prove to be a challenge, as there is limited power on-board and heat dissipation is always a problem. In particular in underwater robotics with limited space, the above two requirements have been proven prohibitive. As first of this kind, this paper aims at analyzing and comparing the performance of several state-of-the-art DNNs on different platforms. With a focus on the underwater domain, the capabilities of the Jetson TX2 from NVIDIA and the Neural Compute Stick from Intel are of particular interest. Experiments on standard datasets show how different platforms are usable on an actual robotic system, providing insights on the current state-of-the-art embedded systems. Based on such results, we propose some guidelines in choosing the appropriate platform and network architecture for a robotic system.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122646786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in computer vision have been driven by the introduction of convolutional neural networks (ConvNets). Almost all existing methods that use hand-crafted features have been re-examined by ConvNets and achieved state of-the-art results on various tasks. However, how ConvNets features lead to outstanding performance is not completely interpretable to humans yet. In this paper, we propose a Hierarchical Feature Map Characterization (HFMC) pipeline in which semantic concepts are mapped to subsets of kernels based on feature maps and corresponding filter responses. We take a closer look at ConvNets feature maps and analyze how taking different sets of feature maps into account affect output accuracy. We first determine a set of kernels named Generic kernels and prune them from the network. We then extract a set of Semantic kernels and analyze their effects on the results. Generic kernels and Semantic kernels are extracted based on the co-occurrence and energy activation levels of feature maps in the network. To evaluate our proposed method, we design a visual recommendation system and apply our HFMC network to retrieve similar styles to query clothing items on the DeepFashion dataset. Extensive experiments demonstrate the effectiveness of our approach to the task of style retrieval on fashion products.
{"title":"Hierarchical Feature Map Characterization in Fashion Interpretation","authors":"M. Ziaeefard, J. Camacaro, C. Bessega","doi":"10.1109/CRV.2018.00022","DOIUrl":"https://doi.org/10.1109/CRV.2018.00022","url":null,"abstract":"Recent advances in computer vision have been driven by the introduction of convolutional neural networks (ConvNets). Almost all existing methods that use hand-crafted features have been re-examined by ConvNets and achieved state of-the-art results on various tasks. However, how ConvNets features lead to outstanding performance is not completely interpretable to humans yet. In this paper, we propose a Hierarchical Feature Map Characterization (HFMC) pipeline in which semantic concepts are mapped to subsets of kernels based on feature maps and corresponding filter responses. We take a closer look at ConvNets feature maps and analyze how taking different sets of feature maps into account affect output accuracy. We first determine a set of kernels named Generic kernels and prune them from the network. We then extract a set of Semantic kernels and analyze their effects on the results. Generic kernels and Semantic kernels are extracted based on the co-occurrence and energy activation levels of feature maps in the network. To evaluate our proposed method, we design a visual recommendation system and apply our HFMC network to retrieve similar styles to query clothing items on the DeepFashion dataset. Extensive experiments demonstrate the effectiveness of our approach to the task of style retrieval on fashion products.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122478671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ehsan Emad Marvasti, Amir Emad Marvasti, H. Foroosh
We introduce the concept of Folded Coding" for continuous univariate distributions estimating the distribution and coding the samples simultaneously. Folded Coding assumes symmetries in the distribution and requires significantly fewer parameters compared to conventional models when the symmetry assumption is satisfied. We incorporate the mechanics of Folded Coding into Convolutional Neural Networks (CNN) in the form of layers referred to as Binary Expanded ReLU (BEReLU) Shared Convolutions and Instance Fully Connected (I-FC). BEReLU and I-FC force the network to have symmetric functionality in the space of samples. Therefore similar patterns of prediction are applied to sections of the space where the model does not have observed samples. We experimented with BEReLU on generic networks using different parameter sizes on CIFAR-10 and CIFAR-100. Our experiments show increased accuracy of the models equipped with the BEReLU layer when there are fewer parameters. The performance of the models with BEReLU layer remains similar to original network with the increase of parameter number. The experiments provide further evidence that estimation of distribution symmetry is part of CNNs' functionality.
{"title":"Exploiting Symmetries of Distributions in CNNs and Folded Coding","authors":"Ehsan Emad Marvasti, Amir Emad Marvasti, H. Foroosh","doi":"10.1109/CRV.2018.00017","DOIUrl":"https://doi.org/10.1109/CRV.2018.00017","url":null,"abstract":"We introduce the concept of Folded Coding\" for continuous univariate distributions estimating the distribution and coding the samples simultaneously. Folded Coding assumes symmetries in the distribution and requires significantly fewer parameters compared to conventional models when the symmetry assumption is satisfied. We incorporate the mechanics of Folded Coding into Convolutional Neural Networks (CNN) in the form of layers referred to as Binary Expanded ReLU (BEReLU) Shared Convolutions and Instance Fully Connected (I-FC). BEReLU and I-FC force the network to have symmetric functionality in the space of samples. Therefore similar patterns of prediction are applied to sections of the space where the model does not have observed samples. We experimented with BEReLU on generic networks using different parameter sizes on CIFAR-10 and CIFAR-100. Our experiments show increased accuracy of the models equipped with the BEReLU layer when there are fewer parameters. The performance of the models with BEReLU layer remains similar to original network with the increase of parameter number. The experiments provide further evidence that estimation of distribution symmetry is part of CNNs' functionality.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129854583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Hryniowski, Ibrahim Ben Daya, A. Gawish, Mark Lamm, A. Wong, P. Fieguth
Projecting the same content with multiple overlapping projectors provides several advantages compared to using a single projector: increased brightness to overcome ambient light or projection surface anomalies, redundancy in case of projector failure, an increase in the area being projected on, and the possibility for increased content resolution. Multi-projector resolution enhancement is the process of using multiple projectors to achieve a resolution greater than any individual projector in the configuration. Current resolution enhancement techniques perform filtering on the sub-images produced by each projector using spatial or frequency based filters. The kernel based filtering adds significant overhead relative to the interpolation calculations. In addition the learned filters are extremely sensitive to calibration. This work develops a method for performing multi-projector resolution enhancement by integrating the filtering into the interpolation process. A system is developed to jointly condition multiple low resolution sub-images on each other to approximate high resolution original content.
{"title":"Multi-projector Resolution Enhancement Through Biased Interpolation","authors":"Andrew Hryniowski, Ibrahim Ben Daya, A. Gawish, Mark Lamm, A. Wong, P. Fieguth","doi":"10.1109/CRV.2018.00035","DOIUrl":"https://doi.org/10.1109/CRV.2018.00035","url":null,"abstract":"Projecting the same content with multiple overlapping projectors provides several advantages compared to using a single projector: increased brightness to overcome ambient light or projection surface anomalies, redundancy in case of projector failure, an increase in the area being projected on, and the possibility for increased content resolution. Multi-projector resolution enhancement is the process of using multiple projectors to achieve a resolution greater than any individual projector in the configuration. Current resolution enhancement techniques perform filtering on the sub-images produced by each projector using spatial or frequency based filters. The kernel based filtering adds significant overhead relative to the interpolation calculations. In addition the learned filters are extremely sensitive to calibration. This work develops a method for performing multi-projector resolution enhancement by integrating the filtering into the interpolation process. A system is developed to jointly condition multiple low resolution sub-images on each other to approximate high resolution original content.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"s3-30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130135284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Model initialisation is an important component of object tracking. Tracking algorithms are generally provided with the first frame of a sequence and a bounding box (BB) indicating the location of the object. This BB may contain a large number of background pixels in addition to the object and can lead to parts-based tracking algorithms initialising their object models in background regions of the BB. In this paper, we tackle this as a missing labels problem, marking pixels sufficiently away from the BB as belonging to the background and learning the labels of the unknown pixels. Three techniques, One-Class SVM (OC-SVM), Sampled-Based Background Model (SBBM) (a novel background model based on pixel samples), and Learning Based Digital Matting (LBDM), are adapted to the problem. These are evaluated with leave-one-video-out cross-validation on the VOT2016 tracking benchmark. Our evaluation shows both OC-SVMs and SBBM are capable of providing a good level of segmentation accuracy but are too parameter-dependent to be used in real-world scenarios. We show that LBDM achieves significantly increased performance with parameters selected by cross validation and we show that it is robust to parameter variation.
{"title":"Visual Object Tracking: The Initialisation Problem","authors":"George De Ath, R. Everson","doi":"10.1109/CRV.2018.00029","DOIUrl":"https://doi.org/10.1109/CRV.2018.00029","url":null,"abstract":"Model initialisation is an important component of object tracking. Tracking algorithms are generally provided with the first frame of a sequence and a bounding box (BB) indicating the location of the object. This BB may contain a large number of background pixels in addition to the object and can lead to parts-based tracking algorithms initialising their object models in background regions of the BB. In this paper, we tackle this as a missing labels problem, marking pixels sufficiently away from the BB as belonging to the background and learning the labels of the unknown pixels. Three techniques, One-Class SVM (OC-SVM), Sampled-Based Background Model (SBBM) (a novel background model based on pixel samples), and Learning Based Digital Matting (LBDM), are adapted to the problem. These are evaluated with leave-one-video-out cross-validation on the VOT2016 tracking benchmark. Our evaluation shows both OC-SVMs and SBBM are capable of providing a good level of segmentation accuracy but are too parameter-dependent to be used in real-world scenarios. We show that LBDM achieves significantly increased performance with parameters selected by cross validation and we show that it is robust to parameter variation.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130154941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}