Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.07972

Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang

{"title":"Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction","authors":"Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang","doi":"arxiv-2409.07972","DOIUrl":null,"url":null,"abstract":"The task of vision-based 3D occupancy prediction aims to reconstruct 3D\ngeometry and estimate its semantic classes from 2D color images, where the\n2D-to-3D view transformation is an indispensable step. Most previous methods\nconduct forward projection, such as BEVPooling and VoxelPooling, both of which\nmap the 2D image features into 3D grids. However, the current grid representing\nfeatures within a certain height range usually introduces many confusing\nfeatures that belong to other height ranges. To address this challenge, we\npresent Deep Height Decoupling (DHD), a novel framework that incorporates\nexplicit height prior to filter out the confusing features. Specifically, DHD\nfirst predicts height maps via explicit supervision. Based on the height\ndistribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to\nadaptively decoupled the height map into multiple binary masks. MGHS projects\nthe 2D image features into multiple subspaces, where each grid contains\nfeatures within reasonable height ranges. Finally, a Synergistic Feature\nAggregation (SFA) module is deployed to enhance the feature representation\nthrough channel and spatial affinities, enabling further occupancy refinement.\nOn the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art\nperformance even with minimal input frames. Code is available at\nhttps://github.com/yanzq95/DHD.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decoupled the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Code is available at https://github.com/yanzq95/DHD.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

深度高度解耦实现基于视觉的精确三维占位预测

基于视觉的三维占位预测任务旨在从二维彩色图像中重建三维几何图形并估计其语义类别，其中二维到三维的视图转换是不可或缺的一步。之前的大多数方法都是进行前向投影，如 BEVPooling 和 VoxelPooling，这两种方法都是将二维图像特征映射到三维网格中。然而，目前表示某一高度范围内特征的网格通常会引入许多属于其他高度范围的混淆特征。为了应对这一挑战，我们提出了深度高度解耦 (DHD)，这是一个新颖的框架，它结合了明确的高度先验来过滤掉混淆的特征。具体来说，DHD 首先通过显式监督来预测高度图。基于高度分布统计，DHD 设计了掩码引导高度采样（MGHS），以适应性地将高度图解耦为多个二进制掩码。MGHS 将二维图像特征投射到多个子空间中，每个网格包含合理高度范围内的特征。最后，我们部署了一个协同特征聚合（SFA）模块，通过通道和空间亲和力来增强特征表示，从而实现进一步的占位细化。在流行的 Occ3D-nuScenes 基准上，即使输入帧数极少，我们的方法也能达到最先进的性能。代码可在https://github.com/yanzq95/DHD。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey