Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang
{"title":"通过双边体素变换器学习精确的单目三维体素表征","authors":"Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang","doi":"10.1016/j.imavis.2024.105237","DOIUrl":null,"url":null,"abstract":"<div><p>Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present <strong>Bilateral Voxel Transformer</strong> (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on <span><span>GitHub</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105237"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning accurate monocular 3D voxel representation via bilateral voxel transformer\",\"authors\":\"Tianheng Cheng , Haoyi Jiang , Shaoyu Chen , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang\",\"doi\":\"10.1016/j.imavis.2024.105237\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present <strong>Bilateral Voxel Transformer</strong> (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on <span><span>GitHub</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"150 \",\"pages\":\"Article 105237\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003421\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003421","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Learning accurate monocular 3D voxel representation via bilateral voxel transformer
Vision-based methods for 3D scene perception have been widely explored for autonomous vehicles. However, inferring complete 3D semantic scenes from monocular 2D images is still challenging owing to the 2D-to-3D transformation. Specifically, existing methods that use Inverse Perspective Mapping (IPM) to project image features to dense 3D voxels severely suffer from the ambiguous projection problem. In this research, we present Bilateral Voxel Transformer (BVT), a novel and effective Transformer-based approach for monocular 3D semantic scene completion. BVT exploits a bilateral architecture composed of two branches for preserving the high-resolution 3D voxel representation while aggregating contexts through the proposed Tri-Axial Transformer simultaneously. To alleviate the ill-posed 2D-to-3D transformation, we adopt position-aware voxel queries and dynamically update the voxels with image features through weighted geometry-aware sampling. BVT achieves 11.8 mIoU on the challenging Semantic KITTI dataset, considerably outperforming previous works for semantic scene completion with monocular images. The code and models of BVT will be available on GitHub.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.