Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00023
Hongyi Fan, B. Kunsberg, B. Kimia
A key bottleneck in the use of Multiview Stereo (MVS) to produce high quality reconstructions is the gaps arising from textureless, shaded areas and lack of fine-scale detail. Shape-from-Shading (SfS) has been used in conjunction with MVS to obtain fine-scale detail and veridical reconstruction in the gap areas. The similarity metric that gauges candidate correspondences is critical to this process, typically a combination of photometric consistency and brightness gradient constancy. Two observations motivate this paper. First, brightness gradient constancy can be erroneous due to foreshortening. Second, the standard ZSSD/NCC patchwise photometric consistency measures when applied to shaded areas is, to a first-order approximation, a calculation of brightness gradient differences, which can be subject to foreshortening. The paper proposes a novel trinocular differential photometric consistency that constrains the brightness gradients in three views so that the image gradient in one view is completely determined by the image gradients at corresponding points in the the other two views. The theoretical developments here advocate the integration of this new measure, whose viability in practice has been demonstrated in a set of illustrative numerical experiments.
{"title":"Differential Photometric Consistency","authors":"Hongyi Fan, B. Kunsberg, B. Kimia","doi":"10.1109/3DV50981.2020.00023","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00023","url":null,"abstract":"A key bottleneck in the use of Multiview Stereo (MVS) to produce high quality reconstructions is the gaps arising from textureless, shaded areas and lack of fine-scale detail. Shape-from-Shading (SfS) has been used in conjunction with MVS to obtain fine-scale detail and veridical reconstruction in the gap areas. The similarity metric that gauges candidate correspondences is critical to this process, typically a combination of photometric consistency and brightness gradient constancy. Two observations motivate this paper. First, brightness gradient constancy can be erroneous due to foreshortening. Second, the standard ZSSD/NCC patchwise photometric consistency measures when applied to shaded areas is, to a first-order approximation, a calculation of brightness gradient differences, which can be subject to foreshortening. The paper proposes a novel trinocular differential photometric consistency that constrains the brightness gradients in three views so that the image gradient in one view is completely determined by the image gradients at corresponding points in the the other two views. The theoretical developments here advocate the integration of this new measure, whose viability in practice has been demonstrated in a set of illustrative numerical experiments.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116861183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00075
Benjamin Graham, David Novotný
We consider the problem of simultaneously estimating a dense depth map and camera pose for a large set of images of an indoor scene. While classical SfM pipelines rely on a two-step approach where cameras are first estimated using a bundle adjustment in order to ground the ensuing multi-view stereo stage, both our poses and dense reconstructions are a direct output of an altered bundle adjuster. To this end, we parametrize each depth map with a linear combination of a limited number of basis “depth-planes” predicted in a monocular fashion by a deep net. Using a set of high-quality sparse keypoint matches, we optimize over the per-frame linear combinations of depth planes and camera poses to form a geometrically consistent cloud of keypoints. Although our bundle adjustment only considers sparse keypoints, the inferred linear coefficients of the basis planes immediately give us dense depth maps. RidgeSfM is able to collectively align hundreds of frames, which is its main advantage over recent memory-heavy deep alternatives that are typically capable of aligning no more than 10 frames. Quantitative comparisons reveal performance superior to a state-of-the-art large-scale SfM pipeline.
{"title":"RidgeSfM: Structure from Motion via Robust Pairwise Matching Under Depth Uncertainty","authors":"Benjamin Graham, David Novotný","doi":"10.1109/3DV50981.2020.00075","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00075","url":null,"abstract":"We consider the problem of simultaneously estimating a dense depth map and camera pose for a large set of images of an indoor scene. While classical SfM pipelines rely on a two-step approach where cameras are first estimated using a bundle adjustment in order to ground the ensuing multi-view stereo stage, both our poses and dense reconstructions are a direct output of an altered bundle adjuster. To this end, we parametrize each depth map with a linear combination of a limited number of basis “depth-planes” predicted in a monocular fashion by a deep net. Using a set of high-quality sparse keypoint matches, we optimize over the per-frame linear combinations of depth planes and camera poses to form a geometrically consistent cloud of keypoints. Although our bundle adjustment only considers sparse keypoints, the inferred linear coefficients of the basis planes immediately give us dense depth maps. RidgeSfM is able to collectively align hundreds of frames, which is its main advantage over recent memory-heavy deep alternatives that are typically capable of aligning no more than 10 frames. Quantitative comparisons reveal performance superior to a state-of-the-art large-scale SfM pipeline.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121033629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00106
Shuai Xie, Wei Ma, Qiuyuan Wang, Ruchang Xu, H. Zha
Feature-based visual SLAM (vSLAM) systems compute camera poses and scene maps by detecting and matching 2D features, mostly being points and line segments, from image sequences. These systems often suffer from unreliable detections. In this paper, we define feature credibility (FC) for both points and line segments, formulate it into vSLAMs and develop an FC-vSLAM system based on the widely used ORB-SLAM framework. Compared with existing credibility definitions, the proposed one, considering both temporal observation stability and perspective triangulation reliability, is more comprehensive. We formulate the credibility in our SLAM system to suppress the influences from unreliable features on the pose and map optimization. We also present a way to improve the line end observations by their multi-view correspondences, to improve the integrity of the 3D maps. Experiments on both the TUM and 7-Scenes datasets demonstrate that our feature credibility and the multi-view line optimization are effective; the developed FC-vSLAM system outperforms existing popular feature-based systems in both localization and mapping.
{"title":"FC-vSLAM: Integrating Feature Credibility in Visual SLAM","authors":"Shuai Xie, Wei Ma, Qiuyuan Wang, Ruchang Xu, H. Zha","doi":"10.1109/3DV50981.2020.00106","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00106","url":null,"abstract":"Feature-based visual SLAM (vSLAM) systems compute camera poses and scene maps by detecting and matching 2D features, mostly being points and line segments, from image sequences. These systems often suffer from unreliable detections. In this paper, we define feature credibility (FC) for both points and line segments, formulate it into vSLAMs and develop an FC-vSLAM system based on the widely used ORB-SLAM framework. Compared with existing credibility definitions, the proposed one, considering both temporal observation stability and perspective triangulation reliability, is more comprehensive. We formulate the credibility in our SLAM system to suppress the influences from unreliable features on the pose and map optimization. We also present a way to improve the line end observations by their multi-view correspondences, to improve the integrity of the 3D maps. Experiments on both the TUM and 7-Scenes datasets demonstrate that our feature credibility and the multi-view line optimization are effective; the developed FC-vSLAM system outperforms existing popular feature-based systems in both localization and mapping.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126094256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00094
Anders Sunegård, L. Svensson, Torsten Sattler
In this paper we propose a method for accurate localization of a multi-layer LiDAR sensor in a pre-recorded map, given a coarse initialization pose. The foundation of the algorithm is the usage of neural network optical flow predictions. We train a network to encode representations of the sensor measurement and the map, and then regress flow vectors at each spatial position in the sensor feature map. The flow regression network is straight-forward to train, and the resulting flow field can be used with standard techniques for computing sensor pose from sensor-to-map correspondences. Additionally, the network can regress flow at different spatial scales, which means that it is able to handle both position recovery and high accuracy localization. We demonstrate average localization accuracy of $lt 0.04{mathrm {m}}$ position and $lt 0.1^{circ}$ heading angle for a vehicle driving application with simulated LiDAR measurements, which is similar to point-to-point iterative closest point (ICP). The algorithm typically manages to recover position with prior error of more than 20m and is significantly more robust to scenes with non-salient or repetitive structure than the baselines used for comparison.
{"title":"Deep LiDAR localization using optical flow sensor-map correspondences","authors":"Anders Sunegård, L. Svensson, Torsten Sattler","doi":"10.1109/3DV50981.2020.00094","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00094","url":null,"abstract":"In this paper we propose a method for accurate localization of a multi-layer LiDAR sensor in a pre-recorded map, given a coarse initialization pose. The foundation of the algorithm is the usage of neural network optical flow predictions. We train a network to encode representations of the sensor measurement and the map, and then regress flow vectors at each spatial position in the sensor feature map. The flow regression network is straight-forward to train, and the resulting flow field can be used with standard techniques for computing sensor pose from sensor-to-map correspondences. Additionally, the network can regress flow at different spatial scales, which means that it is able to handle both position recovery and high accuracy localization. We demonstrate average localization accuracy of $lt 0.04{mathrm {m}}$ position and $lt 0.1^{circ}$ heading angle for a vehicle driving application with simulated LiDAR measurements, which is similar to point-to-point iterative closest point (ICP). The algorithm typically manages to recover position with prior error of more than 20m and is significantly more robust to scenes with non-salient or repetitive structure than the baselines used for comparison.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123565908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00057
Keisuke Kawano, Satoshi Koide, Takuro Kutsuna
The Wasserstein distance has been employed for determining the distance between point clouds, which have variable numbers of points and invariance of point order. However, the high computational cost associated with the Wasserstein distance hinders its practical applications for large-scale datasets. We propose a new embedding method for point clouds, which aims to embed point clouds into a Euclidean space, isometric to the Wasserstein space defined on the point clouds. In numerical experiments, we demonstrate that the point clouds decoded from the Euclidean averages and the interpolations in the embedding space accurately mimic the Wasserstein barycenters and interpolations of the point clouds. Furthermore, we show that the embedding vectors can be utilized as inputs for machine learning models (e.g., principal component analysis and neural networks).
{"title":"Learning Wasserstein Isometric Embedding for Point Clouds","authors":"Keisuke Kawano, Satoshi Koide, Takuro Kutsuna","doi":"10.1109/3DV50981.2020.00057","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00057","url":null,"abstract":"The Wasserstein distance has been employed for determining the distance between point clouds, which have variable numbers of points and invariance of point order. However, the high computational cost associated with the Wasserstein distance hinders its practical applications for large-scale datasets. We propose a new embedding method for point clouds, which aims to embed point clouds into a Euclidean space, isometric to the Wasserstein space defined on the point clouds. In numerical experiments, we demonstrate that the point clouds decoded from the Euclidean averages and the interpolations in the embedding space accurately mimic the Wasserstein barycenters and interpolations of the point clouds. Furthermore, we show that the embedding vectors can be utilized as inputs for machine learning models (e.g., principal component analysis and neural networks).","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129806098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00116
Gil Avraham, Yan Zuo, T. Drummond
Domain adaptation and generative modelling have collectively mitigated the expensive nature of data collection and labelling by leveraging the rich abundance of accurate, labelled data in simulation environments. In this work, we study the performance gap that exists between representations optimised for localisation on simulation environments and the application of such representations in a real-world setting. Our method exploits the shared geometric similarities between simulation and real-world environments whilst maintaining invariance towards visual discrepancies. This is achieved by optimising a representation extractor to project both simulated and real representations into a shared representation space. Our method uses a symmetrical adversarial approach which encourages the representation extractor to conceal the domain that features are extracted from and simultaneously preserves robust attributes between source and target domains that are beneficial for localisation. We evaluate our method by adapting representations optimised for indoor Habitat simulated environments (Matterport3D and Replica) to a real-world indoor environment (Active Vision Dataset), showing that it compares favourably against fully-supervised approaches.
{"title":"Localising In Complex Scenes Using Balanced Adversarial Adaptation","authors":"Gil Avraham, Yan Zuo, T. Drummond","doi":"10.1109/3DV50981.2020.00116","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00116","url":null,"abstract":"Domain adaptation and generative modelling have collectively mitigated the expensive nature of data collection and labelling by leveraging the rich abundance of accurate, labelled data in simulation environments. In this work, we study the performance gap that exists between representations optimised for localisation on simulation environments and the application of such representations in a real-world setting. Our method exploits the shared geometric similarities between simulation and real-world environments whilst maintaining invariance towards visual discrepancies. This is achieved by optimising a representation extractor to project both simulated and real representations into a shared representation space. Our method uses a symmetrical adversarial approach which encourages the representation extractor to conceal the domain that features are extracted from and simultaneously preserves robust attributes between source and target domains that are beneficial for localisation. We evaluate our method by adapting representations optimised for indoor Habitat simulated environments (Matterport3D and Replica) to a real-world indoor environment (Active Vision Dataset), showing that it compares favourably against fully-supervised approaches.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129243314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
photon avalanche diode (SPAD) has been widely used in active 3D imaging due to its extremely high photon sensitivity and picosecond time resolution. However, long-range active 3D imaging is still a great challenge, since only a few signal photons mixed with strong background noise can return from multiple reflectors of the scene due to the divergence of the light beam and the receiver’s field of view (FoV), which would bring considerable distortion and blur to the recovered depth map. In this paper, we propose a deep learning based depth reconstruction method for long range single-photon 3D imaging where the “multiple-returns” issue exists. Specifically, we model this problem as a deblurring task and design a multi-scale convolutional neural network combined with elaborate loss functions, which promote the reconstruction of an accurate depth map with fine details and clear boundaries of objects. The proposed method achieves superior performance over several different sizes of receiver’s FoV on a synthetic dataset compared with existing state-of-the-art methods and the trained model under a specific FoV has a strong generalization capability across different sizes of FoV, which is essential for practical applications. Moreover, we conduct outdoor experiments and demonstrate the effectiveness of our method in a real-world long range imaging system.
{"title":"Deep Learning Based Single-Photon 3D Imaging with Multiple Returns","authors":"Hao Tan, Jiayong Peng, Zhiwei Xiong, Dong Liu, Xin Huang, Zheng-Ping Li, Yu Hong, Feihu Xu","doi":"10.1109/3DV50981.2020.00130","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00130","url":null,"abstract":"photon avalanche diode (SPAD) has been widely used in active 3D imaging due to its extremely high photon sensitivity and picosecond time resolution. However, long-range active 3D imaging is still a great challenge, since only a few signal photons mixed with strong background noise can return from multiple reflectors of the scene due to the divergence of the light beam and the receiver’s field of view (FoV), which would bring considerable distortion and blur to the recovered depth map. In this paper, we propose a deep learning based depth reconstruction method for long range single-photon 3D imaging where the “multiple-returns” issue exists. Specifically, we model this problem as a deblurring task and design a multi-scale convolutional neural network combined with elaborate loss functions, which promote the reconstruction of an accurate depth map with fine details and clear boundaries of objects. The proposed method achieves superior performance over several different sizes of receiver’s FoV on a synthetic dataset compared with existing state-of-the-art methods and the trained model under a specific FoV has a strong generalization capability across different sizes of FoV, which is essential for practical applications. Moreover, we conduct outdoor experiments and demonstrate the effectiveness of our method in a real-world long range imaging system.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121771878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00058
No'e Pion, M. Humenberger, G. Csurka, Yohann Cabon, Torsten Sattler
Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two tasks: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for these tasks. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes. However, robustness to viewpoint changes is not necessarily desirable in the context of visual localization. This paper focuses on understanding the role of image retrieval for multiple visual localization tasks. We introduce a benchmark setup and compare state-of-the-art retrieval representations on multiple datasets. We show that retrieval performance on classical landmark retrieval/recognition tasks correlates only for some but not all tasks to localization performance. This indicates a need for retrieval approaches specifically designed for localization tasks. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.
{"title":"Benchmarking Image Retrieval for Visual Localization","authors":"No'e Pion, M. Humenberger, G. Csurka, Yohann Cabon, Torsten Sattler","doi":"10.1109/3DV50981.2020.00058","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00058","url":null,"abstract":"Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two tasks: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for these tasks. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes. However, robustness to viewpoint changes is not necessarily desirable in the context of visual localization. This paper focuses on understanding the role of image retrieval for multiple visual localization tasks. We introduce a benchmark setup and compare state-of-the-art retrieval representations on multiple datasets. We show that retrieval performance on classical landmark retrieval/recognition tasks correlates only for some but not all tasks to localization performance. This indicates a need for retrieval approaches specifically designed for localization tasks. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132479216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00019
Vladislav Golyanik, Soshi Shimada, C. Theobalt
The problem of simultaneous rigid alignment of multiple unordered point sets which is unbiased towards any of the inputs has recently attracted increasing interest, and several reliable methods have been newly proposed. While being remarkably robust towards noise and clustered outliers, current approaches require sophisticated initialisation schemes and do not scale well to large point sets. This paper proposes a new resilient technique for simultaneous registration of multiple point sets by interpreting the latter as particle swarms rigidly moving in the mutually induced force fields. Thanks to the improved simulation with altered physical laws and acceleration of globally multiply-linked point interactions with a 2D-tree (D is the space dimensionality), our Multi-Body Gravitational Approach (MBGA) is robust to noise and missing data while supporting more massive point sets than previous methods (with 105 points and more). In various experimental settings, MBGA is shown to outperform several baseline point set alignment approaches in terms of accuracy and runtime. We make our source code available for the community to facilitate the reproducibility of the results1.1http://gvv.mpi-inf.mpg.de/projects/MBGA/
{"title":"Fast Simultaneous Gravitational Alignment of Multiple Point Sets","authors":"Vladislav Golyanik, Soshi Shimada, C. Theobalt","doi":"10.1109/3DV50981.2020.00019","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00019","url":null,"abstract":"The problem of simultaneous rigid alignment of multiple unordered point sets which is unbiased towards any of the inputs has recently attracted increasing interest, and several reliable methods have been newly proposed. While being remarkably robust towards noise and clustered outliers, current approaches require sophisticated initialisation schemes and do not scale well to large point sets. This paper proposes a new resilient technique for simultaneous registration of multiple point sets by interpreting the latter as particle swarms rigidly moving in the mutually induced force fields. Thanks to the improved simulation with altered physical laws and acceleration of globally multiply-linked point interactions with a 2D-tree (D is the space dimensionality), our Multi-Body Gravitational Approach (MBGA) is robust to noise and missing data while supporting more massive point sets than previous methods (with 105 points and more). In various experimental settings, MBGA is shown to outperform several baseline point set alignment approaches in terms of accuracy and runtime. We make our source code available for the community to facilitate the reproducibility of the results1.1http://gvv.mpi-inf.mpg.de/projects/MBGA/","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132904503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00072
Andrea D'Eusanio, A. Simoni, S. Pini, G. Borghi, R. Vezzani, R. Cucchiara
Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their application to visual data and, in particular, to the dynamic hand gesture recognition task has not yet been deeply investigated. In this paper, we propose a transformer-based architecture for the dynamic hand gesture recognition task. We show that the employment of a single active depth sensor, specifically the usage of depth maps and the surface normals estimated from them, achieves state-of-the-art results, overcoming all the methods available in the literature on two automotive datasets, namely NVidia Dynamic Hand Gesture and Briareo. Moreover, we test the method with other data types available with common RGB-D devices, such as infrared and color data. We also assess the performance in terms of inference time and number of parameters, showing that the proposed framework is suitable for an online in-car infotainment system.
基于变压器的神经网络代表了一种成功的自注意机制,它在语言理解和序列建模方面取得了最先进的结果。然而,它们在视觉数据中的应用,特别是在动态手势识别任务中的应用尚未得到深入研究。在本文中,我们提出了一种基于变压器的动态手势识别架构。我们表明,使用单个主动深度传感器,特别是使用深度图和从深度图中估计的表面法线,实现了最先进的结果,克服了文献中在两个汽车数据集(即NVidia Dynamic Hand Gesture和Briareo)上可用的所有方法。此外,我们用常见RGB-D设备提供的其他数据类型(如红外和彩色数据)测试了该方法。我们还从推理时间和参数数量方面评估了性能,表明所提出的框架适用于在线车载信息娱乐系统。
{"title":"A Transformer-Based Network for Dynamic Hand Gesture Recognition","authors":"Andrea D'Eusanio, A. Simoni, S. Pini, G. Borghi, R. Vezzani, R. Cucchiara","doi":"10.1109/3DV50981.2020.00072","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00072","url":null,"abstract":"Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their application to visual data and, in particular, to the dynamic hand gesture recognition task has not yet been deeply investigated. In this paper, we propose a transformer-based architecture for the dynamic hand gesture recognition task. We show that the employment of a single active depth sensor, specifically the usage of depth maps and the surface normals estimated from them, achieves state-of-the-art results, overcoming all the methods available in the literature on two automotive datasets, namely NVidia Dynamic Hand Gesture and Briareo. Moreover, we test the method with other data types available with common RGB-D devices, such as infrared and color data. We also assess the performance in terms of inference time and number of parameters, showing that the proposed framework is suitable for an online in-car infotainment system.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133279379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}