CMAE-3D：用于自监督3D对象检测的对比蒙面自动编码器

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Computer Vision Pub Date : 2024-12-11 DOI:10.1007/s11263-024-02313-2

Yanan Zhang, Jiaxin Chen, Di Huang

{"title":"CMAE-3D：用于自监督3D对象检测的对比蒙面自动编码器","authors":"Yanan Zhang, Jiaxin Chen, Di Huang","doi":"10.1007/s11263-024-02313-2","DOIUrl":null,"url":null,"abstract":"<p>LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).\n</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection\",\"authors\":\"Yanan Zhang, Jiaxin Chen, Di Huang\",\"doi\":\"10.1007/s11263-024-02313-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).\\n</p>\",\"PeriodicalId\":13752,\"journal\":{\"name\":\"International Journal of Computer Vision\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":9.3000,\"publicationDate\":\"2024-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11263-024-02313-2\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02313-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于激光雷达的三维目标检测是自动驾驶的关键任务，因为它在三维现实空间中具有准确的目标识别和定位能力。然而，现有的方法严重依赖于耗时费力的大规模标记激光雷达数据，这对性能改进和实际应用都构成了瓶颈。在本文中，我们提出了用于自监督3D物体检测的对比蒙面自动编码器，称为CMAE-3D，这是一种很有前途的解决方案，可以有效地减轻3D感知中的标签依赖。具体而言，我们将对比学习（CL）和掩码自动编码器（MAE）整合到一个统一的框架中，以充分利用全局语义表征和局部空间感知的互补特性。此外，从MAE的角度，我们开发了几何-语义混合掩蔽（GSHM）来选择性地掩盖前背景不平衡和密度分布不均匀的点云中的代表性区域，并设计了多尺度潜在特征重建（MLFR）来捕获高级语义特征，同时减少低级细节的冗余重建。从层次关系对比学习的角度出发，提出层次关系对比学习（HRCL）来挖掘丰富的语义相似信息，同时从体素级和帧级两个层面缓解负样本不匹配的问题。大量的实验证明了我们的预训练方法在三个流行的数据集（KITTI， Waymo和nuScenes）上应用于多个主流3D物体检测器（SECOND， CenterPoint和PV-RCNN）时的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection

LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.

期刊最新文献

Sample-efficient Audio-Visual Learning of Scene Acoustics CoP: Chain of Perception for Referring 3D Instance Segmentation FreeTraj: Tuning-Free Trajectory Control via Noise Guided Video Diffusion Video Shadow Detection with Intra-and Inter-video Cooperation TARGO and TARGO-Net: Benchmarking Target-Driven Object Grasping Under Occlusions