Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE PeerJ Computer Science Pub Date : 2025-01-28 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2656

Zhijian Wang, Jie Liu, Yixiao Sun, Xiang Zhou, Boyan Sun, Dehong Kong, Jay Xu, Xiaoping Yue, Wenyu Zhang

{"title":"Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection.","authors":"Zhijian Wang, Jie Liu, Yixiao Sun, Xiang Zhou, Boyan Sun, Dehong Kong, Jay Xu, Xiaoping Yue, Wenyu Zhang","doi":"10.7717/peerj-cs.2656","DOIUrl":null,"url":null,"abstract":"<p><p>Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other's limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2656"},"PeriodicalIF":3.5000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11784746/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2656","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Monocular 3D object detection is the most widely applied and challenging solution for autonomous driving, due to 2D images lacking 3D information. Existing methods are limited by inaccurate depth estimations by inequivalent supervised targets. The use of both depth and visual features also faces problems of heterogeneous fusion. In this article, we propose Depth Detection Transformer (Depth-DETR), applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection. Depth-DETR introduces two additional depth encoders besides the visual encoder. Two depth encoders are supervised by ground truth depth and bounding box respectively, working independently to complement each other's limitations and predicting more accurate target distances. Furthermore, Depth-DETR employs cross modal attention mechanisms to effectively fuse three different features. A parallel structure of two cross modal transformer is applied to fuse two depth features with visual features. Avoiding early fusion between two depth features enhances the final fused feature for better feature representations. Through multiple experimental validations, the Depth-DETR model has achieved highly competitive results in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, with an AP score of 17.49, representing its outstanding performance in 3D object detection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.