Weakly Supervised Monocular 3D Object Detection by Spatial-Temporal View Consistency

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-24 DOI:10.1109/TPAMI.2024.3466915

Wencheng Han;Runzhou Tao;Haibin Ling;Jianbing Shen

{"title":"Weakly Supervised Monocular 3D Object Detection by Spatial-Temporal View Consistency","authors":"Wencheng Han;Runzhou Tao;Haibin Ling;Jianbing Shen","doi":"10.1109/TPAMI.2024.3466915","DOIUrl":null,"url":null,"abstract":"Monocular 3D object detection plays a crucial role In the field of self-driving cars, estimating the size and location of objects solely based on input images. However, a notable disparity exists between the training and inference of 3D object detectors. This discrepancy arises because during inference, monocular 3D detectors depend solely on images captured by cameras; while during training, these methods require 3D ground truths labeled on point cloud data, which is obtained using specialized devices like LiDAR. This discrepancy creates a break in the data loop, preventing the feedback data from production cars from being utilized to enhance the robustness of the detectors. To address this issue and establish a connection in the data loop, we present a weakly-supervised solution that trains monocular 3D object detectors solely using 2D labels, eliminating the requirement for 3D ground truths. Our approach considers two view consistency: spatial and temporal view consistency, which play a crucial role in regulating the prediction of 3D bounding boxes. Spatial view consistency is achieved by employing projection and multi-view consistency techniques to guide the optimization of the target’s location and size. We leverage temporal viewpoint consistency to provide temporal multi-view image pairs, and we further introduce temporal movement consistency to tackle the challenge of dynamic scenes. With only 2D ground truths, our method achieves comparable performance to fully supervised methods. Additionally, our method can be employed as a pre-training method and achieves significant improvement when fine-tuned with a small proportion of fully supervised labels.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 1","pages":"84-98"},"PeriodicalIF":18.6000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10689672/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Monocular 3D object detection plays a crucial role In the field of self-driving cars, estimating the size and location of objects solely based on input images. However, a notable disparity exists between the training and inference of 3D object detectors. This discrepancy arises because during inference, monocular 3D detectors depend solely on images captured by cameras; while during training, these methods require 3D ground truths labeled on point cloud data, which is obtained using specialized devices like LiDAR. This discrepancy creates a break in the data loop, preventing the feedback data from production cars from being utilized to enhance the robustness of the detectors. To address this issue and establish a connection in the data loop, we present a weakly-supervised solution that trains monocular 3D object detectors solely using 2D labels, eliminating the requirement for 3D ground truths. Our approach considers two view consistency: spatial and temporal view consistency, which play a crucial role in regulating the prediction of 3D bounding boxes. Spatial view consistency is achieved by employing projection and multi-view consistency techniques to guide the optimization of the target’s location and size. We leverage temporal viewpoint consistency to provide temporal multi-view image pairs, and we further introduce temporal movement consistency to tackle the challenge of dynamic scenes. With only 2D ground truths, our method achieves comparable performance to fully supervised methods. Additionally, our method can be employed as a pre-training method and achieves significant improvement when fine-tuned with a small proportion of fully supervised labels.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过时空视图一致性进行弱监督单目三维物体检测

单目3D目标检测在自动驾驶汽车领域发挥着至关重要的作用，仅根据输入图像估计物体的大小和位置。然而，三维目标检测器的训练和推理之间存在着显著的差异。这种差异的产生是因为在推理过程中，单目3D探测器完全依赖于相机捕获的图像；而在训练过程中，这些方法需要在点云数据上标记3D地面真相，这些数据是使用LiDAR等专用设备获得的。这种差异造成了数据循环的中断，阻止了来自量产车的反馈数据被用来增强检测器的鲁棒性。为了解决这个问题并在数据循环中建立连接，我们提出了一个弱监督的解决方案，该解决方案仅使用2D标签训练单眼3D物体检测器，消除了对3D地面事实的要求。我们的方法考虑了两种视图一致性：空间视图一致性和时间视图一致性，它们在调节三维边界盒的预测中起着至关重要的作用。空间视图一致性采用投影和多视图一致性技术来指导目标位置和尺寸的优化。我们利用时间视点一致性来提供时间多视图图像对，并进一步引入时间运动一致性来解决动态场景的挑战。我们的方法仅使用二维地面真理，就可以达到与完全监督方法相当的性能。此外，我们的方法可以作为一种预训练方法，当使用一小部分完全监督标签进行微调时，可以取得显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Spike Camera Optical Flow Estimation Based on Continuous Spike Streams. Bi-C²R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification. Principled Multimodal Representation Learning. Class-Distribution-Aware Pseudo-Labeling for Semi-Supervised Multi-Label Learning. Interacted Planes Reveal 3D Line Mapping.