Efficient and robust multi-camera 3D object detection in bird-eye-view

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2025.105428
Yuanlong Wang, Hengtao Jiang, Guanying Chen, Tong Zhang, Jiaqing Zhou, Zezheng Qing, Chunyan Wang, Wanzhong Zhao
{"title":"Efficient and robust multi-camera 3D object detection in bird-eye-view","authors":"Yuanlong Wang,&nbsp;Hengtao Jiang,&nbsp;Guanying Chen,&nbsp;Tong Zhang,&nbsp;Jiaqing Zhou,&nbsp;Zezheng Qing,&nbsp;Chunyan Wang,&nbsp;Wanzhong Zhao","doi":"10.1016/j.imavis.2025.105428","DOIUrl":null,"url":null,"abstract":"<div><div>Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105428"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000162","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高效鲁棒的鸟瞰图多摄像头三维目标检测
由于鸟瞰(BEV)表示具有全面、无障碍的车辆环境,因此越来越多地用于自动驾驶感知。与基于变压器或深度的方法相比,基于射线变换的方法更适合车辆部署,效率更高。然而,这些方法通常依赖于精确的外部相机参数,当校准错误或安装变化发生时,它们很容易受到性能下降的影响。在本文中,我们遵循基于射线变换的方法,提出了一种无外在参数的方法,该方法通过使用神经网络在线预测外在参数,减少了对精确的离线相机外在定标的依赖,有效地提高了模型的鲁棒性。此外,我们提出了一种多层次、多尺度的图像编码器,以更好地编码图像特征,并采用更密集的时间融合策略。我们的框架进一步主要包含四个重要的设计:(1)多层多尺度图像编码器,利用层间和层内的多尺度信息获得更好的性能;(2)无外在参数的射线变换方法,将图像特征转移到BEV空间,减轻外在干扰对m- model检测性能的影响;(3)利用5个历史帧的运动信息进行密集时间融合策略。(4)高效降低体素特征图空间维度,融合多尺度、多帧BEV特征的高性能BEV编码器。在nuScenes上的实验表明,我们的最佳模型(R101@900 × 1600)在验证集上实现了41.7%的mAP和53.8%的NDS,在3D目标检测中优于几种最先进的视觉BEV模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
期刊最新文献
STIFormer: RGB-T tracking via Spatial–Temporal Interaction Transformer Efficient ultra-lightweight convolutional attention network for embedded identity document recognition system DSAC-Hash: Distribution-Similarity-Aware Cross-modal Hashing Non-target information also matters: InverseFormer tracker for single object tracking RelPose-TTA: Energy-based relative pose correction for test-time adaptation of category-level object pose estimation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1