{"title":"RSBEV-Mamba: 3-D BEV Sequence Modeling for Multiview Remote Sensing Scene Segmentation","authors":"Baihong Lin;Zhengxia Zou;Zhenwei Shi","doi":"10.1109/TGRS.2025.3543200","DOIUrl":null,"url":null,"abstract":"Multiview collaborative perception has been demonstrated to be highly effective in extracting 3-D information from remote sensing scenes by remote sensing bird’s-eye-view (RSBEV). However, inherent depth uncertainty in purely visual methods limits view fusion accuracy, and high computational complexity makes it challenging to model long sequences efficiently. To address these issues, we reformulate the BEV segmentation problem as a 3-D sequence modeling task and propose RSBEV-Mamba, a novel framework comprising a 3-D BEV module, a 3-D VMamba module, and a dense BEV contrastive learning module. The 3-D BEV module projects multiview 2-D image features into 3-D world coordinates, thus establishing a foundation for accurate spatial representation. The 3-D VMamba module, based on state-space models (SSMs), optimizes the processing of densely projected features with linear computational complexity in global 3-D spatial modeling. It incorporates a 3-D selective scanning strategy (SS3D) block with 16 scanning strategies, transforming previously ignored projections at different heights into valid 3-D sequences and enriching the contextual depth and precision of BEV encoding. By employing a contrastive learning strategy with the CLIP model, we align BEV and ground truth (GT) features within the same dimensional framework, ensuring spatial integrity after side-view projection. Our approach achieves a 4% improvement mIoU, thus reaching a score of 0.7368 on LEVIR-MDS and surpassing previous state-of-the-art methods. This establishes the 3-D VMamba module as a general model for 3-D perception tasks and sets a new benchmark in remote sensing technology.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-13"},"PeriodicalIF":8.6000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10891840/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Multiview collaborative perception has been demonstrated to be highly effective in extracting 3-D information from remote sensing scenes by remote sensing bird’s-eye-view (RSBEV). However, inherent depth uncertainty in purely visual methods limits view fusion accuracy, and high computational complexity makes it challenging to model long sequences efficiently. To address these issues, we reformulate the BEV segmentation problem as a 3-D sequence modeling task and propose RSBEV-Mamba, a novel framework comprising a 3-D BEV module, a 3-D VMamba module, and a dense BEV contrastive learning module. The 3-D BEV module projects multiview 2-D image features into 3-D world coordinates, thus establishing a foundation for accurate spatial representation. The 3-D VMamba module, based on state-space models (SSMs), optimizes the processing of densely projected features with linear computational complexity in global 3-D spatial modeling. It incorporates a 3-D selective scanning strategy (SS3D) block with 16 scanning strategies, transforming previously ignored projections at different heights into valid 3-D sequences and enriching the contextual depth and precision of BEV encoding. By employing a contrastive learning strategy with the CLIP model, we align BEV and ground truth (GT) features within the same dimensional framework, ensuring spatial integrity after side-view projection. Our approach achieves a 4% improvement mIoU, thus reaching a score of 0.7368 on LEVIR-MDS and surpassing previous state-of-the-art methods. This establishes the 3-D VMamba module as a general model for 3-D perception tasks and sets a new benchmark in remote sensing technology.
多视角协同感知技术在遥感鸟瞰场景中提取三维信息方面具有很高的效率。然而,纯视觉方法固有的深度不确定性限制了视图融合的精度,并且高计算复杂度给长序列的有效建模带来了挑战。为了解决这些问题,我们将BEV分割问题重新表述为一个3-D序列建模任务,并提出了RSBEV-Mamba,这是一个由3-D BEV模块、3-D VMamba模块和密集BEV对比学习模块组成的新框架。三维BEV模块将多视角二维图像特征投影为三维世界坐标,从而为精确的空间表示奠定基础。基于状态空间模型(ssm)的三维vamba模块优化了全局三维空间建模中具有线性计算复杂度的密集投影特征的处理。该算法结合了包含16种扫描策略的3-D选择性扫描策略(SS3D)块,将之前被忽略的不同高度的投影转换为有效的3-D序列,丰富了BEV编码的上下文深度和精度。通过采用与CLIP模型的对比学习策略,我们将BEV和ground truth (GT)特征对齐在同一维度框架内,确保侧视图投影后的空间完整性。我们的方法实现了4%的mIoU改进,从而在LEVIR-MDS上达到0.7368分,超过了以前最先进的方法。建立了三维vamba模块作为三维感知任务的通用模型,为遥感技术树立了新的标杆。
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.