Attention-based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

ArXiv Pub Date : 2022-11-30 DOI:10.1609/aaai.v37i3.25391
Zizhang Wu, Yunzhe Wu, Jian Pu, Xianzhi Li, Xiaoquan Wang
{"title":"Attention-based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection","authors":"Zizhang Wu, Yunzhe Wu, Jian Pu, Xianzhi Li, Xiaoquan Wang","doi":"10.1609/aaai.v37i3.25391","DOIUrl":null,"url":null,"abstract":"Monocular 3D object detection is a low-cost but challenging task, as it requires generating accurate 3D localization solely from a single image input. Recent developed depth-assisted methods show promising results by using explicit depth maps as intermediate features, which are either precomputed by monocular depth estimation networks or jointly evaluated with 3D object detection. However, inevitable errors from estimated depth priors may lead to misaligned semantic information and 3D localization, hence resulting in feature smearing and suboptimal predictions. To mitigate this issue, we propose ADD, an Attention-based Depth knowledge Distillation framework with 3D-aware positional encoding. Unlike previous knowledge distillation frameworks that adopt stereo- or LiDAR-based teachers, we build up our teacher with identical architecture as the student but with extra ground-truth depth as input. Credit to our teacher design, our framework is seamless, domain-gap free, easily implementable, and is compatible with object-wise ground-truth depth. Specifically, we leverage intermediate features and responses for knowledge distillation. Considering long-range 3D dependencies, we propose 3D-aware self-attention and target-aware cross-attention modules for student adaptation. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. We implement our framework on three representative monocular detectors, and we achieve state-of-the-art performance with no additional inference computational cost relative to baseline models. Our code is available at https://github.com/rockywind/ADD.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":"55 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v37i3.25391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Monocular 3D object detection is a low-cost but challenging task, as it requires generating accurate 3D localization solely from a single image input. Recent developed depth-assisted methods show promising results by using explicit depth maps as intermediate features, which are either precomputed by monocular depth estimation networks or jointly evaluated with 3D object detection. However, inevitable errors from estimated depth priors may lead to misaligned semantic information and 3D localization, hence resulting in feature smearing and suboptimal predictions. To mitigate this issue, we propose ADD, an Attention-based Depth knowledge Distillation framework with 3D-aware positional encoding. Unlike previous knowledge distillation frameworks that adopt stereo- or LiDAR-based teachers, we build up our teacher with identical architecture as the student but with extra ground-truth depth as input. Credit to our teacher design, our framework is seamless, domain-gap free, easily implementable, and is compatible with object-wise ground-truth depth. Specifically, we leverage intermediate features and responses for knowledge distillation. Considering long-range 3D dependencies, we propose 3D-aware self-attention and target-aware cross-attention modules for student adaptation. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. We implement our framework on three representative monocular detectors, and we achieve state-of-the-art performance with no additional inference computational cost relative to baseline models. Our code is available at https://github.com/rockywind/ADD.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于注意力的深度蒸馏与3D感知位置编码的单目3D目标检测
单目3D目标检测是一项低成本但具有挑战性的任务,因为它需要仅从单个图像输入生成准确的3D定位。最近开发的深度辅助方法通过使用显式深度图作为中间特征,通过单目深度估计网络预先计算或与3D目标检测联合评估,显示出有希望的结果。然而,估计深度先验的不可避免的错误可能导致语义信息和3D定位不一致,从而导致特征涂抹和次优预测。为了解决这个问题,我们提出了一个基于注意力的深度知识蒸馏框架,它具有3d感知的位置编码。与以前采用基于立体或激光雷达的教师的知识蒸馏框架不同,我们使用与学生相同的架构构建教师,但使用额外的地面真值深度作为输入。由于我们的教师设计,我们的框架是无缝的,无域间隙的,易于实现的,并且与对象智能地基真值深度兼容。具体来说,我们利用中间特征和响应来进行知识蒸馏。考虑到长期的3D依赖性,我们提出了3D感知的自我注意和目标感知的交叉注意模块,以供学生适应。进行了大量的实验来验证我们的框架在具有挑战性的KITTI 3D目标检测基准上的有效性。我们在三个代表性的单目检测器上实现了我们的框架,并且我们实现了最先进的性能,相对于基线模型没有额外的推理计算成本。我们的代码可在https://github.com/rockywind/ADD上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Dynamical Mechanisms for Coordinating Long-term Working Memory Based on the Precision of Spike-timing in Cortical Neurons. Biomechanically Informed Image Registration for Patient-Specific Aortic Valve Strain Analysis. Fully 3D Unrolled Magnetic Resonance Fingerprinting Reconstruction via Staged Pretraining and Implicit Gridding. An open-source computational framework for immersed fluid-structure interaction modeling using FEBio and MFEM. A tissue-informed deep learning-based method for positron range correction in preclinical 68Ga PET imaging.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1