MD-Mamba: Feature extractor on 3D representation with multi-view depth

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2025-02-01 Epub Date: 2024-12-19 DOI:10.1016/j.imavis.2024.105396

Qihui Li , Zongtan Li , Lianfang Tian , Qiliang Du , Guoyu Lu

{"title":"MD-Mamba: Feature extractor on 3D representation with multi-view depth","authors":"Qihui Li , Zongtan Li , Lianfang Tian , Qiliang Du , Guoyu Lu","doi":"10.1016/j.imavis.2024.105396","DOIUrl":null,"url":null,"abstract":"<div><div>3D sensors provide rich depth information and are widely used across various fields, making 3D vision a hot topic of research. Point cloud data, as a crucial type of 3D data, offers precise three-dimensional coordinate information and is extensively utilized in numerous domains, especially in robotics. However, the unordered and unstructured nature of point cloud data poses a significant challenge for feature extraction. Traditional methods have relied on designing complex local feature extractors to achieve feature extraction, but these approaches have reached a performance bottleneck. To address these challenges, this paper introduces MD-Mamba, a novel network that enhances point cloud feature extraction by integrating multi-view depth maps. Our approach leverages multi-modal learning, treating the multi-view depth maps as an additional global feature modality. By fusing these with locally extracted point cloud features, we achieve richer and more distinctive representations. We utilize an innovative feature extraction strategy, performing real projections of point clouds and treating multi-view projections as video streams. This method captures dynamic features across viewpoints using a specially designed Mamba network. Additionally, the incorporation of the Siamese Cluster module optimizes feature spacing, improving class differentiation. Extensive evaluations on ModelNet40, ShapeNetPart, and ScanObjectNN datasets validate the effectiveness of MD-Mamba, setting a new benchmark for multi-modal feature extraction in point cloud analysis.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105396"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624005018","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

3D sensors provide rich depth information and are widely used across various fields, making 3D vision a hot topic of research. Point cloud data, as a crucial type of 3D data, offers precise three-dimensional coordinate information and is extensively utilized in numerous domains, especially in robotics. However, the unordered and unstructured nature of point cloud data poses a significant challenge for feature extraction. Traditional methods have relied on designing complex local feature extractors to achieve feature extraction, but these approaches have reached a performance bottleneck. To address these challenges, this paper introduces MD-Mamba, a novel network that enhances point cloud feature extraction by integrating multi-view depth maps. Our approach leverages multi-modal learning, treating the multi-view depth maps as an additional global feature modality. By fusing these with locally extracted point cloud features, we achieve richer and more distinctive representations. We utilize an innovative feature extraction strategy, performing real projections of point clouds and treating multi-view projections as video streams. This method captures dynamic features across viewpoints using a specially designed Mamba network. Additionally, the incorporation of the Siamese Cluster module optimizes feature spacing, improving class differentiation. Extensive evaluations on ModelNet40, ShapeNetPart, and ScanObjectNN datasets validate the effectiveness of MD-Mamba, setting a new benchmark for multi-modal feature extraction in point cloud analysis.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MD-Mamba：多视图深度三维表示的特征提取器

3D传感器提供了丰富的深度信息，广泛应用于各个领域，使3D视觉成为研究的热点。点云数据作为一种重要的三维数据类型，提供了精确的三维坐标信息，被广泛应用于许多领域，尤其是机器人领域。然而，点云数据的无序和非结构化特性给特征提取带来了重大挑战。传统的特征提取方法依赖于设计复杂的局部特征提取器来实现特征提取，但这些方法已经达到了性能瓶颈。为了解决这些问题，本文引入了一种新的网络MD-Mamba，该网络通过集成多视图深度图来增强点云特征提取。我们的方法利用多模态学习，将多视图深度图视为额外的全局特征模态。通过将这些特征与局部提取的点云特征融合，我们可以获得更丰富、更独特的表示。我们利用创新的特征提取策略，执行点云的真实投影，并将多视图投影视为视频流。该方法使用特殊设计的Mamba网络捕获跨视点的动态特征。此外，Siamese集群模块的整合优化了特征间隔，提高了类区分。在ModelNet40、ShapeNetPart和ScanObjectNN数据集上的广泛评估验证了MD-Mamba的有效性，为点云分析中的多模态特征提取设定了新的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.