Discriminative trackers, employ a classification approach to separate the target from its background. To cope with variations of the target shape and appearance, the classifier is updated online with different samples of the target and the background. Sample selection, labeling and updating the classifier is prone to various sources of errors that drift the tracker. We introduce the use of an efficient version space shrinking strategy to reduce the labeling errors and enhance its sampling strategy by measuring the uncertainty of the tracker about the samples. The proposed tracker, utilize an ensemble of classifiers that represents different hypotheses about the target, diversify them using boosting to provide a larger and more consistent coverage of the version-space and tune the classifiers' weights in voting. The proposed system adjusts the model update rate by promoting the co-training of the short-memory ensemble with a long-memory oracle. The proposed tracker outperformed state-of-the-art trackers on different sequences bearing various tracking challenges.
{"title":"Efficient Version-Space Reduction for Visual Tracking","authors":"Kourosh Meshgi, Shigeyuki Oba, S. Ishii","doi":"10.1109/CRV.2017.35","DOIUrl":"https://doi.org/10.1109/CRV.2017.35","url":null,"abstract":"Discriminative trackers, employ a classification approach to separate the target from its background. To cope with variations of the target shape and appearance, the classifier is updated online with different samples of the target and the background. Sample selection, labeling and updating the classifier is prone to various sources of errors that drift the tracker. We introduce the use of an efficient version space shrinking strategy to reduce the labeling errors and enhance its sampling strategy by measuring the uncertainty of the tracker about the samples. The proposed tracker, utilize an ensemble of classifiers that represents different hypotheses about the target, diversify them using boosting to provide a larger and more consistent coverage of the version-space and tune the classifiers' weights in voting. The proposed system adjusts the model update rate by promoting the co-training of the short-memory ensemble with a long-memory oracle. The proposed tracker outperformed state-of-the-art trackers on different sequences bearing various tracking challenges.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134036251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a new approach to the long-term tracking of an object in a challenging environment. The object is a cow and the environment is an enclosure in a cowshed. Some of the key challenges in this domain are a cluttered background, low contrast and high similarity between moving objects – which greatly reduces the efficiency of most existing approaches, including those based on background subtraction. Our approach is split into object localization, instance segmentation, learning and tracking stages. Our solution is benchmarked against a range of semi-supervised object tracking algorithms and we show that the performance is strong and well suited to subsequent analysis. We present our solution as a first step towards broader tracking and behavior monitoring for cows in precision agriculture with the ultimate objective of early detection of lameness.
{"title":"Bootstrapping Labelled Dataset Construction for Cow Tracking and Behavior Analysis","authors":"Aram Ter-Sarkisov, R. Ross, John D. Kelleher","doi":"10.1109/CRV.2017.25","DOIUrl":"https://doi.org/10.1109/CRV.2017.25","url":null,"abstract":"This paper introduces a new approach to the long-term tracking of an object in a challenging environment. The object is a cow and the environment is an enclosure in a cowshed. Some of the key challenges in this domain are a cluttered background, low contrast and high similarity between moving objects – which greatly reduces the efficiency of most existing approaches, including those based on background subtraction. Our approach is split into object localization, instance segmentation, learning and tracking stages. Our solution is benchmarked against a range of semi-supervised object tracking algorithms and we show that the performance is strong and well suited to subsequent analysis. We present our solution as a first step towards broader tracking and behavior monitoring for cows in precision agriculture with the ultimate objective of early detection of lameness.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125171958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale variations still pose a challenge in unconstrained face detection. To the best of our knowledge, no current face detection algorithm can detect a face as large as 800 x 800 pixels while simultaneously detecting another one as small as 8 x 8 pixels within a single image with equally high accuracy. We propose a two-stage cascaded face detection framework, Multi-Path Region-based Convolutional Neural Network (MP-RCNN), that seamlessly combines a deep neural network with a classic learning strategy, to tackle this challenge. The first stage is a Multi-Path Region Proposal Network (MP-RPN) that proposes faces at three different scales. It simultaneously utilizes three parallel outputs of the convolutional feature maps to predict multi-scale candidate face regions. The "atrous" convolution trick (convolution with up-sampled filters) and a newly proposed sampling layer for "hard" examples are embedded in MP-RPN to further boost its performance. The second stage is a Boosted Forests classifier, which utilizes deep facial features pooled from inside the candidate face regions as well as deep contextual features pooled from a larger region surrounding the candidate face regions. This step is included to further remove hard negative samples. Experiments show that this approach achieves state-of-the-art face detection performance on the WIDER FACE dataset "hard" partition, outperforming the former best result by 9.6% for the Average Precision.
{"title":"Multi-path Region-Based Convolutional Neural Network for Accurate Detection of Unconstrained \"Hard Faces\"","authors":"Yuguang Liu, M. Levine","doi":"10.1109/CRV.2017.20","DOIUrl":"https://doi.org/10.1109/CRV.2017.20","url":null,"abstract":"Large-scale variations still pose a challenge in unconstrained face detection. To the best of our knowledge, no current face detection algorithm can detect a face as large as 800 x 800 pixels while simultaneously detecting another one as small as 8 x 8 pixels within a single image with equally high accuracy. We propose a two-stage cascaded face detection framework, Multi-Path Region-based Convolutional Neural Network (MP-RCNN), that seamlessly combines a deep neural network with a classic learning strategy, to tackle this challenge. The first stage is a Multi-Path Region Proposal Network (MP-RPN) that proposes faces at three different scales. It simultaneously utilizes three parallel outputs of the convolutional feature maps to predict multi-scale candidate face regions. The \"atrous\" convolution trick (convolution with up-sampled filters) and a newly proposed sampling layer for \"hard\" examples are embedded in MP-RPN to further boost its performance. The second stage is a Boosted Forests classifier, which utilizes deep facial features pooled from inside the candidate face regions as well as deep contextual features pooled from a larger region surrounding the candidate face regions. This step is included to further remove hard negative samples. Experiments show that this approach achieves state-of-the-art face detection performance on the WIDER FACE dataset \"hard\" partition, outperforming the former best result by 9.6% for the Average Precision.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134104678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mennatullah Siam, Abhineet Singh, Camilo Perez, Martin Jägersand
This paper presents two visual trackers from the different paradigms of learning and registration based tracking and evaluates their application in image based visual servoing. They can track object motion with four degrees of freedom (DoF) which, as we will show here, is sufficient for many fine manipulation tasks. One of these trackers is a newly developed learning based tracker that relies on learning discriminative correlation filters while the other is a refinement of a recent 8 DoF RANSAC based tracker adapted with a new appearance model for tracking 4 DoF motion. Both trackers are shown to provide superior performance to several state of the art trackers on an existing dataset for manipulation tasks. Further, a new dataset with challenging sequences for fine manipulation tasks captured from robot mounted eye-in-hand (EIH) cameras is also presented. These sequences have a variety of challenges encountered during real tasks including jittery camera movement, motion blur, drastic scale changes and partial occlusions. Quantitative and qualitative results on these sequences are used to show that these two trackers are robust to failures while providing high precision that makes them suitable for such fine manipulation tasks.
{"title":"4-DoF Tracking for Robot Fine Manipulation Tasks","authors":"Mennatullah Siam, Abhineet Singh, Camilo Perez, Martin Jägersand","doi":"10.1109/CRV.2017.41","DOIUrl":"https://doi.org/10.1109/CRV.2017.41","url":null,"abstract":"This paper presents two visual trackers from the different paradigms of learning and registration based tracking and evaluates their application in image based visual servoing. They can track object motion with four degrees of freedom (DoF) which, as we will show here, is sufficient for many fine manipulation tasks. One of these trackers is a newly developed learning based tracker that relies on learning discriminative correlation filters while the other is a refinement of a recent 8 DoF RANSAC based tracker adapted with a new appearance model for tracking 4 DoF motion. Both trackers are shown to provide superior performance to several state of the art trackers on an existing dataset for manipulation tasks. Further, a new dataset with challenging sequences for fine manipulation tasks captured from robot mounted eye-in-hand (EIH) cameras is also presented. These sequences have a variety of challenges encountered during real tasks including jittery camera movement, motion blur, drastic scale changes and partial occlusions. Quantitative and qualitative results on these sequences are used to show that these two trackers are robust to failures while providing high precision that makes them suitable for such fine manipulation tasks.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128034957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Galloway, Graham W. Taylor, Aaron Ramsay, M. Moussa
An original dataset for semantic segmentation, Ciona17, is introduced, which to the best of the authors' knowledge, is the first dataset of its kind with pixel-level annotations pertaining to invasive species in a marine environment. Diverse outdoor illumination, a range of object shapes, colour, and severe occlusion provide a significant real world challenge for the computer vision community. An accompanying ground-truthing tool for superpixel labeling, Truth and Crop, is also introduced. Finally, we provide a baseline using a variant of Fully Convolutional Networks, and report results in terms of the standard mean intersection over union (mIoU) metric.
介绍了语义分割的原始数据集Ciona17,据作者所知,这是同类数据集中第一个具有与海洋环境中入侵物种相关的像素级注释的数据集。不同的室外照明,各种物体形状,颜色和严重遮挡为计算机视觉社区提供了一个重大的现实世界挑战。此外,还介绍了一种用于超像素标记的地面真实工具Truth and Crop。最后,我们使用全卷积网络的一种变体提供了一个基线,并根据标准平均交联(mIoU)度量报告了结果。
{"title":"The Ciona17 Dataset for Semantic Segmentation of Invasive Species in a Marine Aquaculture Environment","authors":"A. Galloway, Graham W. Taylor, Aaron Ramsay, M. Moussa","doi":"10.5683/SP/NTUOK9","DOIUrl":"https://doi.org/10.5683/SP/NTUOK9","url":null,"abstract":"An original dataset for semantic segmentation, Ciona17, is introduced, which to the best of the authors' knowledge, is the first dataset of its kind with pixel-level annotations pertaining to invasive species in a marine environment. Diverse outdoor illumination, a range of object shapes, colour, and severe occlusion provide a significant real world challenge for the computer vision community. An accompanying ground-truthing tool for superpixel labeling, Truth and Crop, is also introduced. Finally, we provide a baseline using a variant of Fully Convolutional Networks, and report results in terms of the standard mean intersection over union (mIoU) metric.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131265682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a novel damage assessment deep model for buildings. Common damage assessment approaches utilize both pre-event and post-event data, which are not available in many cases. In this work, we focus on assessing damage to buildings using only post-disaster. We estimate severity of destruction via in a continuous fashion. Our model utilizes three different neural networks, one network for pre-processing the input data and two networks for extracting deep features from the input source. Combinations of these networks are distributed among three separate feature streams. A regressor summarizes the extracted features into a single continuous value denoting the destruction level. To evaluate the model, we collected a small dataset of ground-level image data of damaged buildings. Experimental results demonstrate that models taking advantage of hierarchical rich features outperform baseline methods.
{"title":"Building Damage Assessment Using Deep Learning and Ground-Level Image Data","authors":"Karoon Rashedi Nia, Greg Mori","doi":"10.1109/CRV.2017.54","DOIUrl":"https://doi.org/10.1109/CRV.2017.54","url":null,"abstract":"We propose a novel damage assessment deep model for buildings. Common damage assessment approaches utilize both pre-event and post-event data, which are not available in many cases. In this work, we focus on assessing damage to buildings using only post-disaster. We estimate severity of destruction via in a continuous fashion. Our model utilizes three different neural networks, one network for pre-processing the input data and two networks for extracting deep features from the input source. Combinations of these networks are distributed among three separate feature streams. A regressor summarizes the extracted features into a single continuous value denoting the destruction level. To evaluate the model, we collected a small dataset of ground-level image data of damaged buildings. Experimental results demonstrate that models taking advantage of hierarchical rich features outperform baseline methods.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114269584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the Kernelized Correlation Filters tracker (KCF) achieved competitive performance and robustness in visual object tracking. On the other hand, visual trackers are not typically used in multiple object tracking. In this paper, we investigate how a robust visual tracker like KCF can improve multiple object tracking. Since KCF is a fast tracker, many KCF can be used in parallel and still result in fast tracking. We built a multiple object tracking system based on KCF and background subtraction. Background subtraction is applied to extract moving objects and get their scale and size in combination with KCF outputs, while KCF is used for data association and to handle fragmentation and occlusion problems. As a result, KCF and background subtraction help each other to take tracking decision at every frame. Sometimes KCF outputs are the most trustworthy (e.g. during occlusion), while in some other cases, it is the background subtraction outputs. To validate the effectiveness of our system, the algorithm was tested on four urban traffic videos from a standard dataset. Results show that our method is competitive with state-of-the-art trackers even if we use a much simpler data association step.
{"title":"Multiple Object Tracking with Kernelized Correlation Filters in Urban Mixed Traffic","authors":"Yuebin Yang, Guillaume-Alexandre Bilodeau","doi":"10.1109/CRV.2017.18","DOIUrl":"https://doi.org/10.1109/CRV.2017.18","url":null,"abstract":"Recently, the Kernelized Correlation Filters tracker (KCF) achieved competitive performance and robustness in visual object tracking. On the other hand, visual trackers are not typically used in multiple object tracking. In this paper, we investigate how a robust visual tracker like KCF can improve multiple object tracking. Since KCF is a fast tracker, many KCF can be used in parallel and still result in fast tracking. We built a multiple object tracking system based on KCF and background subtraction. Background subtraction is applied to extract moving objects and get their scale and size in combination with KCF outputs, while KCF is used for data association and to handle fragmentation and occlusion problems. As a result, KCF and background subtraction help each other to take tracking decision at every frame. Sometimes KCF outputs are the most trustworthy (e.g. during occlusion), while in some other cases, it is the background subtraction outputs. To validate the effectiveness of our system, the algorithm was tested on four urban traffic videos from a standard dataset. Results show that our method is competitive with state-of-the-art trackers even if we use a much simpler data association step.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123650486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse subspace clustering (SSC) is an elegant approach for unsupervised segmentation if the data points of each cluster are located in linear subspaces. This model applies, for instance, in motion segmentation if some restrictions on the camera model hold. SSC requires that problems based on the l1-norm are solved to infer which points belong to the same subspace. If these unknown subspaces are well-separated this algorithm is guaranteed to succeed. The question how the distribution of points on the same subspace effects their clustering has received less attention. One case has been reported in which points of the same model are erroneously classified to belong to different subspaces. In this work, it will be theoretically shown when and why such spurious clusters occur. This claim is further substantiated by experimental evidence. Two algorithms based on the Dantzig selector and subspace selector are proposed to overcome this problem, and good results are reported.
{"title":"Unbiased Sparse Subspace Clustering by Selective Pursuit","authors":"H. Ackermann, B. Rosenhahn, M. Yang","doi":"10.1109/CRV.2017.28","DOIUrl":"https://doi.org/10.1109/CRV.2017.28","url":null,"abstract":"Sparse subspace clustering (SSC) is an elegant approach for unsupervised segmentation if the data points of each cluster are located in linear subspaces. This model applies, for instance, in motion segmentation if some restrictions on the camera model hold. SSC requires that problems based on the l1-norm are solved to infer which points belong to the same subspace. If these unknown subspaces are well-separated this algorithm is guaranteed to succeed. The question how the distribution of points on the same subspace effects their clustering has received less attention. One case has been reported in which points of the same model are erroneously classified to belong to different subspaces. In this work, it will be theoretically shown when and why such spurious clusters occur. This claim is further substantiated by experimental evidence. Two algorithms based on the Dantzig selector and subspace selector are proposed to overcome this problem, and good results are reported.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124641080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unmanned Aerial Vehicles (UAVs) enable numerous applications such as search and rescue operations, structural inspection of buildings, crop growth analysis in agriculture, performing 3D reconstruction and so on. For such applications, currently the UAV is steered manually. However, in this paper we aim to record semi-professional video footage (e.g. concerts, sport events) using fully autonomous UAVs. Evidently, this is challenging since we need to detect and track the actor on-board a UAV in real-time, while automatically – and smoothly – controlling the UAV based on these detections. For this, all four DOF (Degrees of freedom) are controlled in separate simultaneous control loops by our vision-based algorithms. Furthermore cinematographic rules need to be taken into account (e.g. the rule of thirds) which position the actor at the visually optimal location in the frame. We extensively validated our algorithms: each control loop and the overall final system is thoroughly evaluated with respect to both accuracy and control speed. We show that our system is able to efficiently control the UAV such that professional recordings are obtained.
{"title":"Autonomous Flying Cameraman with Embedded Person Detection and Tracking while Applying Cinematographic Rules","authors":"D. Hulens, T. Goedemé","doi":"10.1109/CRV.2017.27","DOIUrl":"https://doi.org/10.1109/CRV.2017.27","url":null,"abstract":"Unmanned Aerial Vehicles (UAVs) enable numerous applications such as search and rescue operations, structural inspection of buildings, crop growth analysis in agriculture, performing 3D reconstruction and so on. For such applications, currently the UAV is steered manually. However, in this paper we aim to record semi-professional video footage (e.g. concerts, sport events) using fully autonomous UAVs. Evidently, this is challenging since we need to detect and track the actor on-board a UAV in real-time, while automatically – and smoothly – controlling the UAV based on these detections. For this, all four DOF (Degrees of freedom) are controlled in separate simultaneous control loops by our vision-based algorithms. Furthermore cinematographic rules need to be taken into account (e.g. the rule of thirds) which position the actor at the visually optimal location in the frame. We extensively validated our algorithms: each control loop and the overall final system is thoroughly evaluated with respect to both accuracy and control speed. We show that our system is able to efficiently control the UAV such that professional recordings are obtained.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128943058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}