Efficient and robust multi-camera 3D object detection in bird-eye-view

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2025.105428

Yuanlong Wang, Hengtao Jiang, Guanying Chen, Tong Zhang, Jiaqing Zhou, Zezheng Qing, Chunyan Wang, Wanzhong Zhao

{"title":"Efficient and robust multi-camera 3D object detection in bird-eye-view","authors":"Yuanlong Wang, Hengtao Jiang, Guanying Chen, Tong Zhang, Jiaqing Zhou, Zezheng Qing, Chunyan Wang, Wanzhong Zhao","doi":"10.1016/j.imavis.2025.105428","DOIUrl":null,"url":null,"abstract":"<div><div>Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105428"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000162","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高效鲁棒的鸟瞰图多摄像头三维目标检测

由于鸟瞰（BEV）表示具有全面、无障碍的车辆环境，因此越来越多地用于自动驾驶感知。与基于变压器或深度的方法相比，基于射线变换的方法更适合车辆部署，效率更高。然而，这些方法通常依赖于精确的外部相机参数，当校准错误或安装变化发生时，它们很容易受到性能下降的影响。在本文中，我们遵循基于射线变换的方法，提出了一种无外在参数的方法，该方法通过使用神经网络在线预测外在参数，减少了对精确的离线相机外在定标的依赖，有效地提高了模型的鲁棒性。此外，我们提出了一种多层次、多尺度的图像编码器，以更好地编码图像特征，并采用更密集的时间融合策略。我们的框架进一步主要包含四个重要的设计：(1)多层多尺度图像编码器，利用层间和层内的多尺度信息获得更好的性能；(2)无外在参数的射线变换方法，将图像特征转移到BEV空间，减轻外在干扰对m- model检测性能的影响；(3)利用5个历史帧的运动信息进行密集时间融合策略。(4)高效降低体素特征图空间维度，融合多尺度、多帧BEV特征的高性能BEV编码器。在nuScenes上的实验表明，我们的最佳模型（R101@900 × 1600）在验证集上实现了41.7%的mAP和53.8%的NDS，在3D目标检测中优于几种最先进的视觉BEV模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.