We demonstrate the use of semantic object detections as robust features for Visual Teach and Repeat (VTR). Recent CNN-based object detectors are able to reliably detect objects of tens or hundreds of categories in video at frame rates. We show that such detections are repeatable enough to use as landmarks for VTR, without any low-level image features. Since object detections are highly invariant to lighting and surface appearance changes, our VTR can cope with global lighting changes and local movements of the landmark objects. In the teaching phase we build extremely compact scene descriptors: a list of detected object labels and their image-plane locations. In the repeating phase, we use Seq-SLAM-like relocalization to identify the most similar learned scene, then use a motion control algorithm based on the funnel lane theory to navigate the robot along the previously piloted trajectory. We evaluate the method on a commodity UAV, examining the robustness of the algorithm to new viewpoints, lighting conditions, and movements of landmark objects. The results suggest that semantic object features could be useful due to their invariance to superficial appearance changes compared to low-level image features.
{"title":"Robust UAV Visual Teach and Repeat Using Only Sparse Semantic Object Features","authors":"A. Toudeshki, Faraz Shamshirdar, R. Vaughan","doi":"10.1109/CRV.2018.00034","DOIUrl":"https://doi.org/10.1109/CRV.2018.00034","url":null,"abstract":"We demonstrate the use of semantic object detections as robust features for Visual Teach and Repeat (VTR). Recent CNN-based object detectors are able to reliably detect objects of tens or hundreds of categories in video at frame rates. We show that such detections are repeatable enough to use as landmarks for VTR, without any low-level image features. Since object detections are highly invariant to lighting and surface appearance changes, our VTR can cope with global lighting changes and local movements of the landmark objects. In the teaching phase we build extremely compact scene descriptors: a list of detected object labels and their image-plane locations. In the repeating phase, we use Seq-SLAM-like relocalization to identify the most similar learned scene, then use a motion control algorithm based on the funnel lane theory to navigate the robot along the previously piloted trajectory. We evaluate the method on a commodity UAV, examining the robustness of the algorithm to new viewpoints, lighting conditions, and movements of landmark objects. The results suggest that semantic object features could be useful due to their invariance to superficial appearance changes compared to low-level image features.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132816884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we investigate urban reconstruction and propose a complete and automatic framework for reconstructing urban areas from remote sensing data. Firstly, we address the complex problem of semantic labeling and propose a novel network architecture named SegNeXT which combines the strengths of deep-autoencoders with feed-forward links in generating smooth predictions and reducing the number of learning parameters, with the effectiveness which cardinality-enabled residual-based building blocks have shown in improving prediction accuracy and outperforming deeper/wider network architectures with a smaller number of learning parameters. The network is trained with benchmark datasets and the reported results show that it can provide at least similar and in some cases better classification than state-of-the-art. Secondly, we address the problem of urban reconstruction and propose a complete pipeline for automatically converting semantic labels into virtual representations of the urban areas. An agglomerative clustering is performed on the points according to their classification and results in a set of contiguous and disjoint clusters. Finally, each cluster is processed according to the class it belongs: tree clusters are substituted with procedural models, cars are replaced with simplified CAD models, buildings' boundaries are extruded to form 3D models, and road, low vegetation, and clutter clusters are triangulated and simplified. The result is a complete virtual representation of the urban area. The proposed framework has been extensively tested on large-scale benchmark datasets and the semantic labeling and reconstruction results are reported.
{"title":"Deep Autoencoders with Aggregated Residual Transformations for Urban Reconstruction from Remote Sensing Data","authors":"T. Forbes, Charalambos (Charis) Poullis","doi":"10.1109/CRV.2018.00014","DOIUrl":"https://doi.org/10.1109/CRV.2018.00014","url":null,"abstract":"In this work we investigate urban reconstruction and propose a complete and automatic framework for reconstructing urban areas from remote sensing data. Firstly, we address the complex problem of semantic labeling and propose a novel network architecture named SegNeXT which combines the strengths of deep-autoencoders with feed-forward links in generating smooth predictions and reducing the number of learning parameters, with the effectiveness which cardinality-enabled residual-based building blocks have shown in improving prediction accuracy and outperforming deeper/wider network architectures with a smaller number of learning parameters. The network is trained with benchmark datasets and the reported results show that it can provide at least similar and in some cases better classification than state-of-the-art. Secondly, we address the problem of urban reconstruction and propose a complete pipeline for automatically converting semantic labels into virtual representations of the urban areas. An agglomerative clustering is performed on the points according to their classification and results in a set of contiguous and disjoint clusters. Finally, each cluster is processed according to the class it belongs: tree clusters are substituted with procedural models, cars are replaced with simplified CAD models, buildings' boundaries are extruded to form 3D models, and road, low vegetation, and clutter clusters are triangulated and simplified. The result is a complete virtual representation of the urban area. The proposed framework has been extensively tested on large-scale benchmark datasets and the semantic labeling and reconstruction results are reported.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132765477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md.Atiqur Rahman, Prince Kapoor, R. Laganière, Daniel Laroche, Changyun Zhu, Xiaoyin Xu, A. Ors
In this paper, we present a comparative study of two state-of-the-art object detection architectures - an end-to-end CNN-based framework called SSD [1] and an LSTM-based framework [2] which we refer to as LSTM-decoder. To this end, we study the two architectures in the context of people head detection on few benchmark datasets having small to moderately large number of head instances appearing in varying scales and occlusion levels. In order to better capture the pros and cons of the two architectures, we applied them with several deep feature extractors (e.g., Inception-V2, Inception-ResNet-V2 and MobileNet-V1) and report accuracy, speed and generalization ability of the approaches. Our experimental results show that while the LSTM-decoder can be more accurate in realizing smaller head instances especially in the presence of occlusions, the sheer detection speed and superior ability to generalize over multiple scales make SSD an ideal choice for real-time people detection.
{"title":"Deep People Detection: A Comparative Study of SSD and LSTM-decoder","authors":"Md.Atiqur Rahman, Prince Kapoor, R. Laganière, Daniel Laroche, Changyun Zhu, Xiaoyin Xu, A. Ors","doi":"10.1109/CRV.2018.00050","DOIUrl":"https://doi.org/10.1109/CRV.2018.00050","url":null,"abstract":"In this paper, we present a comparative study of two state-of-the-art object detection architectures - an end-to-end CNN-based framework called SSD [1] and an LSTM-based framework [2] which we refer to as LSTM-decoder. To this end, we study the two architectures in the context of people head detection on few benchmark datasets having small to moderately large number of head instances appearing in varying scales and occlusion levels. In order to better capture the pros and cons of the two architectures, we applied them with several deep feature extractors (e.g., Inception-V2, Inception-ResNet-V2 and MobileNet-V1) and report accuracy, speed and generalization ability of the approaches. Our experimental results show that while the LSTM-decoder can be more accurate in realizing smaller head instances especially in the presence of occlusions, the sheer detection speed and superior ability to generalize over multiple scales make SSD an ideal choice for real-time people detection.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115689949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.
{"title":"Context-Aware Action Detection in Untrimmed Videos Using Bidirectional LSTM","authors":"Jaideep Singh Chauhan, Yang Wang","doi":"10.1109/CRV.2018.00039","DOIUrl":"https://doi.org/10.1109/CRV.2018.00039","url":null,"abstract":"We consider the problem of action detection in untrimmed videos. We argue that the contextual information in a video is important for this task. Based on this intuition, we design a network using a bidirectional Long Short Term Memory (Bi-LSTM) model that captures the contextual information in videos. Our model includes a modified loss function which enforces the network to learn action progression, and a backpropagation in which gradients are weighted on the basis of their origin on the temporal scale. LSTMs are good at capturing the long temporal dependencies, but not so good at modeling local temporal features. In our model, we use a 3-D Convolutional Neural Network (3-D ConvNet) for capturing the local spatio-temporal features of the videos. We perform a comprehensive analysis on the importance of learning the context of the video. Finally, we evaluate our work on two action detection datasets, namely ActivityNet and THUMOS'14. Our method achieves competitive results compared with the existing approaches on both datasets.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125506757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matching an occluded contour with all the full contours in a database is an NP-hard problem. We present a suboptimal solution for this problem in this paper. We demonstrate the efficacy of our algorithm by matching partially occluded leaves with a database of full leaves. We smooth the leaf contours using a beta spline and then use the Discrete Contour Evaluation (DCE) algorithm to extract feature points. We then use subgraph matching, using the DCE points as graph nodes. This algorithm decomposes each closed contour into many open contours. We compute a number of similarity parameters for each open contour and the occluded contour. We perform an inverse similarity transform on the occluded contour. This allows the occluded contour and any open contour to be overlaid". We that compute the quality of matching for each such pair of open contours using the Fréchet distance metric. We select the best eta matched contours. Since the Fréchet distance metric is computationally cheap to compute but not always guaranteed to produce the best answer we then use an energy functional that always find best match among the best eta matches but is considerably more expensive to compute. The functional uses local and global curvature String Context descriptors and String Cut features. We minimize this energy functional using the well known GNCCP algorithm for the eta open contours yielding the best match. Experiments on a publicly available leaf image database shows that our method is both effective and efficient significantly outperforming other current state-of-the-art leaf matching methods when faced with leaf occlusion.
{"title":"Occluded Leaf Matching with Full Leaf Databases Using Explicit Occlusion Modelling","authors":"Ayan Chaudhury, J. Barron","doi":"10.1109/CRV.2018.00012","DOIUrl":"https://doi.org/10.1109/CRV.2018.00012","url":null,"abstract":"Matching an occluded contour with all the full contours in a database is an NP-hard problem. We present a suboptimal solution for this problem in this paper. We demonstrate the efficacy of our algorithm by matching partially occluded leaves with a database of full leaves. We smooth the leaf contours using a beta spline and then use the Discrete Contour Evaluation (DCE) algorithm to extract feature points. We then use subgraph matching, using the DCE points as graph nodes. This algorithm decomposes each closed contour into many open contours. We compute a number of similarity parameters for each open contour and the occluded contour. We perform an inverse similarity transform on the occluded contour. This allows the occluded contour and any open contour to be overlaid\". We that compute the quality of matching for each such pair of open contours using the Fréchet distance metric. We select the best eta matched contours. Since the Fréchet distance metric is computationally cheap to compute but not always guaranteed to produce the best answer we then use an energy functional that always find best match among the best eta matches but is considerably more expensive to compute. The functional uses local and global curvature String Context descriptors and String Cut features. We minimize this energy functional using the well known GNCCP algorithm for the eta open contours yielding the best match. Experiments on a publicly available leaf image database shows that our method is both effective and efficient significantly outperforming other current state-of-the-art leaf matching methods when faced with leaf occlusion.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122818771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human pose estimation in ice hockey is one of the biggest challenges in computer vision-driven sports analytics, with a variety of difficulties such as bulky hockey wear, color similarity between ice rink and player jersey and the presence of additional sports equipment used by the players such as hockey sticks. As such, deep neural network architectures typically used for sports including baseball, soccer, and track and field perform poorly when applied to hockey. Inspired by the idea that the position of the hockey sticks can not only be useful for improving hockey player pose estimation but also can be used for assessing a player's performance, a novel HyperStackNet architecture has been designed and implemented for joint player and stick pose estimation. In addition to improving player pose estimation, the HyperStackNet architecture enables improved transfer learning from pre-trained stacked hourglass networks trained on a different domain. Experimental results demonstrate that when the HyperStackNet is trained to detect 18 different joint positions on a hockey player (including the hockey stick) the accuracy is 98.8% on the test dataset, thus demonstrating its efficacy for handling complex joint player and stick pose estimation from video.
{"title":"HyperStackNet: A Hyper Stacked Hourglass Deep Convolutional Neural Network Architecture for Joint Player and Stick Pose Estimation in Hockey","authors":"H. Neher, Kanav Vats, A. Wong, David A Clausi","doi":"10.1109/CRV.2018.00051","DOIUrl":"https://doi.org/10.1109/CRV.2018.00051","url":null,"abstract":"Human pose estimation in ice hockey is one of the biggest challenges in computer vision-driven sports analytics, with a variety of difficulties such as bulky hockey wear, color similarity between ice rink and player jersey and the presence of additional sports equipment used by the players such as hockey sticks. As such, deep neural network architectures typically used for sports including baseball, soccer, and track and field perform poorly when applied to hockey. Inspired by the idea that the position of the hockey sticks can not only be useful for improving hockey player pose estimation but also can be used for assessing a player's performance, a novel HyperStackNet architecture has been designed and implemented for joint player and stick pose estimation. In addition to improving player pose estimation, the HyperStackNet architecture enables improved transfer learning from pre-trained stacked hourglass networks trained on a different domain. Experimental results demonstrate that when the HyperStackNet is trained to detect 18 different joint positions on a hockey player (including the hockey stick) the accuracy is 98.8% on the test dataset, thus demonstrating its efficacy for handling complex joint player and stick pose estimation from video.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115447911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Sekkati, Jonathan Boisvert, G. Godin, L. Borgeat
This paper presents a real-time 3D shape fusion system that faithfully integrates very high resolution 3D scans with the goal of maximizing details preservation. The system fully maps complex shapes while allowing free movement similarly to dense SLAM systems in robotics where sensor fusion techniques map large environments. We propose a novel framework to integrate shapes into a volume with fine details preservation of the reconstructed shape which is an important aspect in many applications, especially for industrial inspection. The truncated signed distance function is generalized with a global variational scheme that controls edge preservation and leads to updating cumulative rules adapted for GPU implementation. The framework also embeds a map deformation method to online deform the shape and correct the system trajectory drift at few microns accuracy. Results are presented from the integrated system on two mechanical objects which illustrate the benefits of the proposed approach.
{"title":"Real-Time Large-Scale Fusion of High Resolution 3D Scans with Details Preservation","authors":"H. Sekkati, Jonathan Boisvert, G. Godin, L. Borgeat","doi":"10.1109/CRV.2018.00019","DOIUrl":"https://doi.org/10.1109/CRV.2018.00019","url":null,"abstract":"This paper presents a real-time 3D shape fusion system that faithfully integrates very high resolution 3D scans with the goal of maximizing details preservation. The system fully maps complex shapes while allowing free movement similarly to dense SLAM systems in robotics where sensor fusion techniques map large environments. We propose a novel framework to integrate shapes into a volume with fine details preservation of the reconstructed shape which is an important aspect in many applications, especially for industrial inspection. The truncated signed distance function is generalized with a global variational scheme that controls edge preservation and leads to updating cumulative rules adapted for GPU implementation. The framework also embeds a map deformation method to online deform the shape and correct the system trajectory drift at few microns accuracy. Results are presented from the integrated system on two mechanical objects which illustrate the benefits of the proposed approach.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134037556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual homing enables an autonomous robot to move to a target (home) position using only visual information. While 2D visual homing has been widely studied, homing in 3D space still requires much attention. This paper presents a novel 3D visual homing method which can be applied to commodity Unmanned Aerial Vehicles (UAVs). Firstly, relative camera poses are estimated through feature correspondences between current views and the reference home image. Then homing vectors are computed and utilized to guide the UAV toward the 3D home location. All computations can be performed in real-time on mobile devices through a mobile app. To validate our approach, we conducted quantitative evaluations on the most popular image sequence datasets and performed real experiments on a quadcopter (i.e., DJI Mavic Pro). Experimental results demonstrate the effectiveness of the proposed method.
{"title":"3D Visual Homing for Commodity UAVs","authors":"Hao Cai, Sipan Ye, A. Vardy, Minglun Gong","doi":"10.1109/CRV.2018.00045","DOIUrl":"https://doi.org/10.1109/CRV.2018.00045","url":null,"abstract":"Visual homing enables an autonomous robot to move to a target (home) position using only visual information. While 2D visual homing has been widely studied, homing in 3D space still requires much attention. This paper presents a novel 3D visual homing method which can be applied to commodity Unmanned Aerial Vehicles (UAVs). Firstly, relative camera poses are estimated through feature correspondences between current views and the reference home image. Then homing vectors are computed and utilized to guide the UAV toward the 3D home location. All computations can be performed in real-time on mobile devices through a mobile app. To validate our approach, we conducted quantitative evaluations on the most popular image sequence datasets and performed real experiments on a quadcopter (i.e., DJI Mavic Pro). Experimental results demonstrate the effectiveness of the proposed method.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116051406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new method for learning filters for the 2D discrete wavelet transform. We extend our previous work on the 1D wavelet transform in order to process images. We show that the 2D wavelet transform can be represented as a modified convolutional neural network (CNN). Doing so allows us to learn wavelet filters from data by gradient descent. Our learned wavelets are similar to traditional wavelets which are typically derived using Fourier methods. For filter comparison, we make use of a cosine measure under all filter rotations. The learned wavelets are able to capture the structure of the training data. Furthermore, we can generate images from our model in order to evaluate the filters. The main findings of this work is that wavelet functions can arise naturally from data, without the need for Fourier methods. Our model requires relatively few parameters compared to traditional CNNs, and is easily incorporated into neural network frameworks.
{"title":"Learning Filters for the 2D Wavelet Transform","authors":"D. Recoskie, Richard Mann","doi":"10.1109/CRV.2018.00036","DOIUrl":"https://doi.org/10.1109/CRV.2018.00036","url":null,"abstract":"We propose a new method for learning filters for the 2D discrete wavelet transform. We extend our previous work on the 1D wavelet transform in order to process images. We show that the 2D wavelet transform can be represented as a modified convolutional neural network (CNN). Doing so allows us to learn wavelet filters from data by gradient descent. Our learned wavelets are similar to traditional wavelets which are typically derived using Fourier methods. For filter comparison, we make use of a cosine measure under all filter rotations. The learned wavelets are able to capture the structure of the training data. Furthermore, we can generate images from our model in order to evaluate the filters. The main findings of this work is that wavelet functions can arise naturally from data, without the need for Fourier methods. Our model requires relatively few parameters compared to traditional CNNs, and is easily incorporated into neural network frameworks.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"309 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114808345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new algorithm for single camera 3D reconstruction, or 3D input for human-computer interfaces, based on precise tracking of an elongated object, such as a pen, having a pattern of colored bands. To configure the system, the user provides no more than one labelled image of a handmade pointer, measurements of its colored bands, and the camera's pinhole projection matrix. Other systems are of much higher cost and complexity, requiring combinations of multiple cameras, stereocameras, and pointers with sensors and lights. Instead of relying on information from multiple devices, we examine our single view more closely, integrating geometric and appearance constraints to robustly track the pointer in the presence of occlusion and distractor objects. By probing objects of known geometry with the pointer, we demonstrate acceptable accuracy of 3D localization.
{"title":"Do-It-Yourself Single Camera 3D Pointer Input Device","authors":"Bernard Llanos, Herbert Yang","doi":"10.1109/CRV.2018.00038","DOIUrl":"https://doi.org/10.1109/CRV.2018.00038","url":null,"abstract":"We present a new algorithm for single camera 3D reconstruction, or 3D input for human-computer interfaces, based on precise tracking of an elongated object, such as a pen, having a pattern of colored bands. To configure the system, the user provides no more than one labelled image of a handmade pointer, measurements of its colored bands, and the camera's pinhole projection matrix. Other systems are of much higher cost and complexity, requiring combinations of multiple cameras, stereocameras, and pointers with sensors and lights. Instead of relying on information from multiple devices, we examine our single view more closely, integrating geometric and appearance constraints to robustly track the pointer in the presence of occlusion and distractor objects. By probing objects of known geometry with the pointer, we demonstrate acceptable accuracy of 3D localization.","PeriodicalId":281779,"journal":{"name":"2018 15th Conference on Computer and Robot Vision (CRV)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125024051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}