Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-09-26 DOI:10.1109/TIP.2024.3461989

Jianbiao Mei;Yu Yang;Mengmeng Wang;Junyu Zhu;Jongwon Ra;Yukai Ma;Laijian Li;Yong Liu

{"title":"Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network","authors":"Jianbiao Mei;Yu Yang;Mengmeng Wang;Junyu Zhu;Jongwon Ra;Yukai Ma;Laijian Li;Yong Liu","doi":"10.1109/TIP.2024.3461989","DOIUrl":null,"url":null,"abstract":"Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at \n<uri>https://github.com/Jieqianyu/SGN</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5468-5481"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10694710/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at https://github.com/Jieqianyu/SGN .

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用稀疏制导网络实现基于摄像头的 3D 语义场景补全

语义场景补全（SSC）旨在从有限的观测数据中预测整个三维场景中每个体素的语义占位情况，这是自动驾驶的一项新兴而关键的任务。最近，由于摄像头具有更丰富的视觉线索和成本效益，许多研究转向了基于摄像头的 SSC 解决方案。然而，现有的方法通常依赖于复杂而厚重的三维模型来直接处理提取的三维特征，而这些特征的判别能力不足以实现清晰的分割边界。在本文中，我们采用密集-稀疏-密集的设计，提出了一种基于摄像头的单级 SSC 框架（称为 SGN），根据空间几何线索将语义从语义感知种子体素传播到整个场景。首先，为了利用深度感知上下文并动态选择稀疏种子体素，我们重新设计了稀疏体素建议网络，以粗到细的范式直接处理深度预测生成的点。此外，通过设计混合引导（稀疏语义和几何引导）和有效的空间几何线索体素聚合，我们增强了不同类别之间的特征分离，并加快了语义传播的收敛。最后，我们为灵活的感受野设计了多尺度语义传播模块，同时减少了计算资源。在 SemanticKITTI 和 SSCBench-KITTI-360 数据集上的大量实验结果表明，我们的 SGN 优于现有的先进方法。即使是我们的轻量级版本SGN-L，在SeamnticKITTI验证中也取得了14.80% mIoU和45.45% IoU的显著成绩，而参数只有12.5 M，训练内存只有7.16 G。代码见 https://github.com/Jieqianyu/SGN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量

期刊最新文献

Learning Cross-Attention Point Transformer With Global Porous Sampling Salient Object Detection From Arbitrary Modalities GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning AnlightenDiff: Anchoring Diffusion Probabilistic Model on Low Light Image Enhancement Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection