基于先验遮蔽的 BEV 3D 物体检测精简框架

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-08-15 DOI:10.1016/j.imavis.2024.105229

Qinglin Tong , Junjie Zhang , Chenggang Yan , Dan Zeng

{"title":"基于先验遮蔽的 BEV 3D 物体检测精简框架","authors":"Qinglin Tong , Junjie Zhang , Chenggang Yan , Dan Zeng","doi":"10.1016/j.imavis.2024.105229","DOIUrl":null,"url":null,"abstract":"<div><p>In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105229"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A streamlined framework for BEV-based 3D object detection with prior masking\",\"authors\":\"Qinglin Tong , Junjie Zhang , Chenggang Yan , Dan Zeng\",\"doi\":\"10.1016/j.imavis.2024.105229\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.</p></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"150 \",\"pages\":\"Article 105229\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003342\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003342","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在自动驾驶领域，基于 "鸟瞰视角"（BEV）的感知任务因其众多优点而备受研究关注。尽管最近在性能方面取得了进步，但效率仍然是现实世界中实施的一个挑战。在本研究中，我们提出了一个高效的框架，它能从多摄像头输入构建时空 BEV 特征，并利用它进行 3D 物体检测。具体来说，我们网络的成功主要归功于提升策略和定制 BEV 编码器的设计。提升策略的任务是将 2D 特征转换为 3D 表示。在图像中没有深度信息的情况下，我们创新性地为 BEV 特征引入了一个先验掩码，它能以较低的成本评估沿摄像机光线的特征的重要性。此外，我们还设计了一种轻量级 BEV 编码器，大大提高了这种物理解释表示的能力。在编码器中，我们研究了 BEV 特征的空间关系，并从上游保留了丰富的残余信息。为了进一步提高性能，我们建立了一个二维物体检测辅助头，以深入研究二维物体检测提供的洞察力，并利用四维信息来探索序列中的线索。得益于所有这些设计，我们的网络可以从三维场景中捕捉到丰富的语义信息，并在效率和性能之间取得平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A streamlined framework for BEV-based 3D object detection with prior masking

In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.