Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00091
T. Pribanić, T. Petković, David Bojanić, Kristijan Bartol
Time-of-flight (ToF) cameras are becoming increasingly popular for 3D imaging. Their optimal usage has been studied from the several aspects. One of the open research problems is the possibility of a multicamera interference problem when two or more ToF cameras are operating simultaneously. In this work we present an efficient method to synchronize multiple operating ToF cameras. Our method is based on the time-division multiplexing, but unlike traditional time multiplexing, it does not decrease the effective camera frame rate. Additionally, for unsynchronized cameras, we provide a robust method to extract from their corresponding video streams, frames which are not subject to multicamera interference problem. We demonstrate our approach through a series of experiments and with a different level of support available for triggering, ranging from a hardware triggering to purely random software triggering.
{"title":"Smart Time-Multiplexing of Quads Solves the Multicamera Interference Problem","authors":"T. Pribanić, T. Petković, David Bojanić, Kristijan Bartol","doi":"10.1109/3DV50981.2020.00091","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00091","url":null,"abstract":"Time-of-flight (ToF) cameras are becoming increasingly popular for 3D imaging. Their optimal usage has been studied from the several aspects. One of the open research problems is the possibility of a multicamera interference problem when two or more ToF cameras are operating simultaneously. In this work we present an efficient method to synchronize multiple operating ToF cameras. Our method is based on the time-division multiplexing, but unlike traditional time multiplexing, it does not decrease the effective camera frame rate. Additionally, for unsynchronized cameras, we provide a robust method to extract from their corresponding video streams, frames which are not subject to multicamera interference problem. We demonstrate our approach through a series of experiments and with a different level of support available for triggering, ranging from a hardware triggering to purely random software triggering.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130317912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00040
Vincent Leroy, Philippe Weinzaepfel, Romain Br'egier, Hadrien Combaluzier, Grégory Rogez
Predicting 3D human pose from images has seen great recent improvements. Novel approaches that can even predict both pose and shape from a single input image have been introduced, often relying on a parametric model of the human body such as SMPL. While qualitative results for such methods are often shown for images captured in-the-wild, a proper benchmark in such conditions is still missing, as it is cumbersome to obtain ground-truth 3D poses elsewhere than in a motion capture room. This paper presents a pipeline to easily produce and validate such a dataset with accurate ground-truth, with which we benchmark recent 3D human pose estimation methods in-the-wild. We make use of the recently introduced Mannequin Challenge dataset which contains in-the-wild videos of people frozen in action like statues and leverage the fact that people are static and the camera moving to accurately fit the SMPL model on the sequences. A total of 24,428 frames with registered body models are then selected from 567 scenes at almost no cost, using only online RGB videos. We benchmark state-of-the-art SMPL-based human pose estimation methods on this dataset. Our results highlight that challenges remain, in particular for difficult poses or for scenes where the persons are partially truncated or occluded.
{"title":"SMPLy Benchmarking 3D Human Pose Estimation in the Wild","authors":"Vincent Leroy, Philippe Weinzaepfel, Romain Br'egier, Hadrien Combaluzier, Grégory Rogez","doi":"10.1109/3DV50981.2020.00040","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00040","url":null,"abstract":"Predicting 3D human pose from images has seen great recent improvements. Novel approaches that can even predict both pose and shape from a single input image have been introduced, often relying on a parametric model of the human body such as SMPL. While qualitative results for such methods are often shown for images captured in-the-wild, a proper benchmark in such conditions is still missing, as it is cumbersome to obtain ground-truth 3D poses elsewhere than in a motion capture room. This paper presents a pipeline to easily produce and validate such a dataset with accurate ground-truth, with which we benchmark recent 3D human pose estimation methods in-the-wild. We make use of the recently introduced Mannequin Challenge dataset which contains in-the-wild videos of people frozen in action like statues and leverage the fact that people are static and the camera moving to accurately fit the SMPL model on the sequences. A total of 24,428 frames with registered body models are then selected from 567 scenes at almost no cost, using only online RGB videos. We benchmark state-of-the-art SMPL-based human pose estimation methods on this dataset. Our results highlight that challenges remain, in particular for difficult poses or for scenes where the persons are partially truncated or occluded.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130740396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00110
Sandro Lombardi, Martin R. Oswald, M. Pollefeys
Surface reconstruction from point clouds has been a well-studied research topic with applications in computer vision and computer graphics. Recently, several learningbased methods were proposed for 3D shape representation through implicit functions which among others can be used for point cloud-based reconstruction. Although delivering compelling results for synthetic object datasets of overseeable size, they fail to represent larger scenes accurately, presumably due to the use of only one global latent code for encoding an entire scene or object. We propose to encode only parts of objects with features attached to unstructured point clouds. To this end we use a hierarchical feature map in 3D space, extracted from the input point clouds, with which local latent shape encodings can be queried at arbitrary positions. We use a permutohedral lattice to process the hierarchical feature maps sparsely and efficiently. This enables accurate and detailed point cloud-based reconstructions for large amounts of points in a time-efficient manner, showing good generalization capabilities across different datasets. Experiments on synthetic and real world datasets demonstrate the reconstruction capability of our method and compare favorably to state-of-the-art methods.
{"title":"Scalable Point Cloud-based Reconstruction with Local Implicit Functions","authors":"Sandro Lombardi, Martin R. Oswald, M. Pollefeys","doi":"10.1109/3DV50981.2020.00110","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00110","url":null,"abstract":"Surface reconstruction from point clouds has been a well-studied research topic with applications in computer vision and computer graphics. Recently, several learningbased methods were proposed for 3D shape representation through implicit functions which among others can be used for point cloud-based reconstruction. Although delivering compelling results for synthetic object datasets of overseeable size, they fail to represent larger scenes accurately, presumably due to the use of only one global latent code for encoding an entire scene or object. We propose to encode only parts of objects with features attached to unstructured point clouds. To this end we use a hierarchical feature map in 3D space, extracted from the input point clouds, with which local latent shape encodings can be queried at arbitrary positions. We use a permutohedral lattice to process the hierarchical feature maps sparsely and efficiently. This enables accurate and detailed point cloud-based reconstructions for large amounts of points in a time-efficient manner, showing good generalization capabilities across different datasets. Experiments on synthetic and real world datasets demonstrate the reconstruction capability of our method and compare favorably to state-of-the-art methods.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132014237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00060
Hugues Thomas
Recent attempts at introducing rotation invariance or equivariance in 3D deep learning approaches have shown promising results, but these methods still struggle to reach the performances of standard 3D neural networks. In this work we study the relation between equivariance and invariance in 3D point convolutions. We show that using rotation-equivariant alignments, it is possible to make any convolutional layer rotation-invariant. Furthermore, we improve this simple alignment procedure by using the alignment themselves as features in the convolution, and by combining multiple alignments together. With this core layer, we design rotation-invariant architectures which improve state-of-the-art results in both object classification and semantic segmentation and reduces the gap between rotation-invariant and standard 3D deep learning approaches.
{"title":"Rotation-Invariant Point Convolution With Multiple Equivariant Alignments.","authors":"Hugues Thomas","doi":"10.1109/3DV50981.2020.00060","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00060","url":null,"abstract":"Recent attempts at introducing rotation invariance or equivariance in 3D deep learning approaches have shown promising results, but these methods still struggle to reach the performances of standard 3D neural networks. In this work we study the relation between equivariance and invariance in 3D point convolutions. We show that using rotation-equivariant alignments, it is possible to make any convolutional layer rotation-invariant. Furthermore, we improve this simple alignment procedure by using the alignment themselves as features in the convolution, and by combining multiple alignments together. With this core layer, we design rotation-invariant architectures which improve state-of-the-art results in both object classification and semantic segmentation and reduces the gap between rotation-invariant and standard 3D deep learning approaches.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00104
Erik Stenborg, Torsten Sattler, Lars Hammarstrand
Estimating the pose of a camera in a known scene, i.e., visual localization, is a core task for applications such as self-driving cars. In many scenarios, image sequences are available and existing work on combining single-image localization with odometry offers to unlock their potential for improving localization performance. Still, the largest part of the literature focuses on single-image localization and ignores the availability of sequence data. The goal of this paper is to demonstrate the potential of image sequences in challenging scenarios, e.g., under day-night or seasonal changes. Combining ideas from the literature, we describe a sequence-based localization pipeline that combines odometry with both a coarse and a fine localization module. Experiments on long-term localization datasets show that combining single-image global localization against a prebuilt map with a visual odometry / SLAM pipeline improves performance to a level where the extended CMU Seasons dataset can be considered solved. We show that SIFT features can perform on par with modern state-of-the-art features in our framework, despite being much weaker and a magnitude faster to compute. Our code is publicly available at github.com/rulllars.
{"title":"Using Image Sequences for Long-Term Visual Localization","authors":"Erik Stenborg, Torsten Sattler, Lars Hammarstrand","doi":"10.1109/3DV50981.2020.00104","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00104","url":null,"abstract":"Estimating the pose of a camera in a known scene, i.e., visual localization, is a core task for applications such as self-driving cars. In many scenarios, image sequences are available and existing work on combining single-image localization with odometry offers to unlock their potential for improving localization performance. Still, the largest part of the literature focuses on single-image localization and ignores the availability of sequence data. The goal of this paper is to demonstrate the potential of image sequences in challenging scenarios, e.g., under day-night or seasonal changes. Combining ideas from the literature, we describe a sequence-based localization pipeline that combines odometry with both a coarse and a fine localization module. Experiments on long-term localization datasets show that combining single-image global localization against a prebuilt map with a visual odometry / SLAM pipeline improves performance to a level where the extended CMU Seasons dataset can be considered solved. We show that SIFT features can perform on par with modern state-of-the-art features in our framework, despite being much weaker and a magnitude faster to compute. Our code is publicly available at github.com/rulllars.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131460905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00054
Ignacio Sarasua, Sebastian Pölsterl, C. Wachinger
Spatial and channel re-calibration have become powerful concepts in computer vision. Their ability to capture long-range dependencies is especially useful for those networks that extract local features, such as CNNs. While recalibration has been widely studied for image analysis, it has not yet been used on shape representations. In this work, we introduce re-calibration modules on deep neural networks for 3D point clouds. We propose a set of re-calibration blocks that extend Squeeze and Excitation blocks [11] and that can be added to any network for 3D point cloud analysis that builds a global descriptor by hierarchically combining features from multiple local neighborhoods. We run two sets of experiments to validate our approach. First, we demonstrate the benefit and versatility of our proposed modules by incorporating them into three state-of-the-art networks for 3D point cloud analysis: PointNet++ [22], DGCNN [29], and RSCNN [18]. We evaluate each network on two tasks: object classification on ModelNet40, and object part segmentation on ShapeNet. Our results show an improvement of up to 1% in accuracy for ModelNet40 compared to the baseline method. In the second set of experiments, we investigate the benefits of re-calibration blocks on Alzheimer’s Disease (AD) diagnosis. Our results demonstrate that our proposed methods yield a 2% increase in accuracy for diagnosing AD and a 2.3% increase in concordance index for predicting AD onset with time-to-event analysis. Concluding, re-calibration improves the accuracy of point cloud architectures, while only minimally increasing the number of parameters.
{"title":"Recalibration of Neural Networks for Point Cloud Analysis","authors":"Ignacio Sarasua, Sebastian Pölsterl, C. Wachinger","doi":"10.1109/3DV50981.2020.00054","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00054","url":null,"abstract":"Spatial and channel re-calibration have become powerful concepts in computer vision. Their ability to capture long-range dependencies is especially useful for those networks that extract local features, such as CNNs. While recalibration has been widely studied for image analysis, it has not yet been used on shape representations. In this work, we introduce re-calibration modules on deep neural networks for 3D point clouds. We propose a set of re-calibration blocks that extend Squeeze and Excitation blocks [11] and that can be added to any network for 3D point cloud analysis that builds a global descriptor by hierarchically combining features from multiple local neighborhoods. We run two sets of experiments to validate our approach. First, we demonstrate the benefit and versatility of our proposed modules by incorporating them into three state-of-the-art networks for 3D point cloud analysis: PointNet++ [22], DGCNN [29], and RSCNN [18]. We evaluate each network on two tasks: object classification on ModelNet40, and object part segmentation on ShapeNet. Our results show an improvement of up to 1% in accuracy for ModelNet40 compared to the baseline method. In the second set of experiments, we investigate the benefits of re-calibration blocks on Alzheimer’s Disease (AD) diagnosis. Our results demonstrate that our proposed methods yield a 2% increase in accuracy for diagnosing AD and a 2.3% increase in concordance index for predicting AD onset with time-to-event analysis. Concluding, re-calibration improves the accuracy of point cloud architectures, while only minimally increasing the number of parameters.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131328905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00052
Amit Kohli, V. Sitzmann, Gordon Wetzstein
The recent success of implicit neural scene representations has presented a viable new method for how we capture and store 3D scenes. Unlike conventional 3D representations, such as point clouds, which explicitly store scene properties in discrete, localized units, these implicit representations encode a scene in the weights of a neural network which can be queried at any coordinate to produce these same scene properties. Thus far, implicit representations have primarily been optimized to estimate only the appearance and/or 3D geometry information in a scene. We take the next step and demonstrate that an existing implicit representation (SRNs) [67] is actually multi-modal; it can be further leveraged to perform per-point semantic segmentation while retaining its ability to represent appearance and geometry. To achieve this multi-modal behavior, we utilize a semi-supervised learning strategy atop the existing pre-trained scene representation. Our method is simple, general, and only requires a few tens of labeled 2D segmentation masks in order to achieve dense 3D semantic segmentation. We explore two novel applications for this semantically aware implicit neural scene representation: 3D novel view and semantic label synthesis given only a single input RGB image or 2D label mask, as well as 3D interpolation of appearance and semantics.
{"title":"Semantic Implicit Neural Scene Representations With Semi-Supervised Training","authors":"Amit Kohli, V. Sitzmann, Gordon Wetzstein","doi":"10.1109/3DV50981.2020.00052","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00052","url":null,"abstract":"The recent success of implicit neural scene representations has presented a viable new method for how we capture and store 3D scenes. Unlike conventional 3D representations, such as point clouds, which explicitly store scene properties in discrete, localized units, these implicit representations encode a scene in the weights of a neural network which can be queried at any coordinate to produce these same scene properties. Thus far, implicit representations have primarily been optimized to estimate only the appearance and/or 3D geometry information in a scene. We take the next step and demonstrate that an existing implicit representation (SRNs) [67] is actually multi-modal; it can be further leveraged to perform per-point semantic segmentation while retaining its ability to represent appearance and geometry. To achieve this multi-modal behavior, we utilize a semi-supervised learning strategy atop the existing pre-trained scene representation. Our method is simple, general, and only requires a few tens of labeled 2D segmentation masks in order to achieve dense 3D semantic segmentation. We explore two novel applications for this semantically aware implicit neural scene representation: 3D novel view and semantic label synthesis given only a single input RGB image or 2D label mask, as well as 3D interpolation of appearance and semantics.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116267631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00073
Basavaraj Hampiholi, Christian Jarvers, W. Mader, H. Neumann
Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human-machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in successive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS-TCN) that operates on full temporal resolution and with reduced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates using RDDB. We show that our DS-TCN is capable of capturing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.
{"title":"Depthwise Separable Temporal Convolutional Network for Action Segmentation","authors":"Basavaraj Hampiholi, Christian Jarvers, W. Mader, H. Neumann","doi":"10.1109/3DV50981.2020.00073","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00073","url":null,"abstract":"Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human-machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in successive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS-TCN) that operates on full temporal resolution and with reduced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates using RDDB. We show that our DS-TCN is capable of capturing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122776509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00035
Benjamin Holzschuh, Zorah Lähner, D. Cremers
We propose to use Simulated Annealing to solve the correspondence problem between near-isometric 3D shapes. Our method gains efficiency through quickly upsampling a sparse correspondence by minimizing the embedding error of new samples on the surfaces and applying simulated annealing to refine the result. The algorithm alternates between sampling additional points on the surface and swapping points within the current solution according to Simulated Annealing theory. Simulated Annealing is a probabilistic method and less prone to get stuck in local extrema which allows us to obtain good results on the NPhard quadratic assignment problem} (QAP). Our method can be used as a stand-alone correspondence pipeline through an initial seed generator as well as to densify a set of sparse input matches. Furthermore, the use of locality sensitive hashing to approximate geodesic distances reduces the computational complexity and memory consumption significantly. This allows our algorithm to run on meshes with over 100k points, an accomplishment that few approaches tackling the QAP directly achieve. We show convincing results on datasets like TOSCA and SHREC’19 Connecitvity.
{"title":"Simulated Annealing for 3D Shape Correspondence","authors":"Benjamin Holzschuh, Zorah Lähner, D. Cremers","doi":"10.1109/3DV50981.2020.00035","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00035","url":null,"abstract":"We propose to use Simulated Annealing to solve the correspondence problem between near-isometric 3D shapes. Our method gains efficiency through quickly upsampling a sparse correspondence by minimizing the embedding error of new samples on the surfaces and applying simulated annealing to refine the result. The algorithm alternates between sampling additional points on the surface and swapping points within the current solution according to Simulated Annealing theory. Simulated Annealing is a probabilistic method and less prone to get stuck in local extrema which allows us to obtain good results on the NPhard quadratic assignment problem} (QAP). Our method can be used as a stand-alone correspondence pipeline through an initial seed generator as well as to densify a set of sparse input matches. Furthermore, the use of locality sensitive hashing to approximate geodesic distances reduces the computational complexity and memory consumption significantly. This allows our algorithm to run on meshes with over 100k points, an accomplishment that few approaches tackling the QAP directly achieve. We show convincing results on datasets like TOSCA and SHREC’19 Connecitvity.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123475521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/3DV50981.2020.00115
Victor Zuanazzi, Joris van Vugt, O. Booij, P. Mettes
This work proposes a metric learning approach for self-supervised scene flow estimation. Scene flow estimation is the task of estimating 3D flow vectors for consecutive 3D point clouds. Such flow vectors are fruitful, e.g. for recognizing actions, or avoiding collisions. Training a neural network via supervised learning for scene flow is impractical, as this requires manual annotations for each 3D point at each new timestamp for each scene. To that end, we seek for a self-supervised approach, where a network learns a latent metric to distinguish between points translated by flow estimations and the target point cloud. Our adversarial metric learning includes a multi-scale triplet loss on sequences of two-point clouds as well as a cycle consistency loss. Furthermore, we outline a benchmark for self-supervised scene flow estimation: the Scene Flow Sandbox. The benchmark consists of five datasets designed to study individual aspects of flow estimation in progressive order of complexity, from a moving object to real-world scenes. Experimental evaluation on the benchmark shows that our approach obtains state-of-the-art self-supervised scene flow results, outperforming recent neighbor-based approaches. We use our proposed benchmark to expose shortcomings and draw insights on various training setups. We find that our setup captures motion coherence and preserves local geometries. Dealing with occlusions, on the other hand, is still an open challenge.
{"title":"Adversarial Self-Supervised Scene Flow Estimation","authors":"Victor Zuanazzi, Joris van Vugt, O. Booij, P. Mettes","doi":"10.1109/3DV50981.2020.00115","DOIUrl":"https://doi.org/10.1109/3DV50981.2020.00115","url":null,"abstract":"This work proposes a metric learning approach for self-supervised scene flow estimation. Scene flow estimation is the task of estimating 3D flow vectors for consecutive 3D point clouds. Such flow vectors are fruitful, e.g. for recognizing actions, or avoiding collisions. Training a neural network via supervised learning for scene flow is impractical, as this requires manual annotations for each 3D point at each new timestamp for each scene. To that end, we seek for a self-supervised approach, where a network learns a latent metric to distinguish between points translated by flow estimations and the target point cloud. Our adversarial metric learning includes a multi-scale triplet loss on sequences of two-point clouds as well as a cycle consistency loss. Furthermore, we outline a benchmark for self-supervised scene flow estimation: the Scene Flow Sandbox. The benchmark consists of five datasets designed to study individual aspects of flow estimation in progressive order of complexity, from a moving object to real-world scenes. Experimental evaluation on the benchmark shows that our approach obtains state-of-the-art self-supervised scene flow results, outperforming recent neighbor-based approaches. We use our proposed benchmark to expose shortcomings and draw insights on various training setups. We find that our setup captures motion coherence and preserves local geometries. Dealing with occlusions, on the other hand, is still an open challenge.","PeriodicalId":293399,"journal":{"name":"2020 International Conference on 3D Vision (3DV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129000386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}