{"title":"A Lightweight Semantic Segmentation Network Based on Self-Attention Mechanism and State Space Model for Efficient Urban Scene Segmentation","authors":"Langping Li;Jizheng Yi;Hui Fan;Hui Lin","doi":"10.1109/TGRS.2025.3562185","DOIUrl":null,"url":null,"abstract":"In the semantic segmentation of remote sensing images, methods based on convolutional neural networks (CNNs) and Transformers have been extensively studied. Nevertheless, CNN struggles to capture the global context due to its local feature extraction, while Transformer is constrained by the complexity of quadratic calculations. Recently, there has been a great deal of interest in Mamba-based state space models. However, the existing Mamba-based methods do not adequately consider the significance of local information in remote sensing image segmentation tasks. In this article, a codec style network UMFormer is constructed for the semantic segmentation of remote sensing images. Specifically, UMFormer employs the ResNet18 as the encoder, with the objective of performing a preliminary image feature extraction. Subsequently, a self-attention mechanism is optimized to extract the global information pertaining to the objects of disparate sizes within the context of a multiscale condition. For fusing the codec feature map information, another attention structure is built to reconstruct the space information and to capture the relative position relationship. Finally, a decoder based on Mamba is designed to effectively model both global and local information. Concurrently, a feature fusion mechanism utilizing feature similarity is devised with the objective of embedding local information into global ones. Numerous experiments on UAV Imagery Dataset (UAVid), Vaihingen, and Potsdam datasets have demonstrated that the proposed UMFormer exhibits enhanced accuracy while maintaining an efficient running speed. The code will be freely available at: <uri>https://github.com/takeyoutime/UMFormer</uri>","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10969832/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In the semantic segmentation of remote sensing images, methods based on convolutional neural networks (CNNs) and Transformers have been extensively studied. Nevertheless, CNN struggles to capture the global context due to its local feature extraction, while Transformer is constrained by the complexity of quadratic calculations. Recently, there has been a great deal of interest in Mamba-based state space models. However, the existing Mamba-based methods do not adequately consider the significance of local information in remote sensing image segmentation tasks. In this article, a codec style network UMFormer is constructed for the semantic segmentation of remote sensing images. Specifically, UMFormer employs the ResNet18 as the encoder, with the objective of performing a preliminary image feature extraction. Subsequently, a self-attention mechanism is optimized to extract the global information pertaining to the objects of disparate sizes within the context of a multiscale condition. For fusing the codec feature map information, another attention structure is built to reconstruct the space information and to capture the relative position relationship. Finally, a decoder based on Mamba is designed to effectively model both global and local information. Concurrently, a feature fusion mechanism utilizing feature similarity is devised with the objective of embedding local information into global ones. Numerous experiments on UAV Imagery Dataset (UAVid), Vaihingen, and Potsdam datasets have demonstrated that the proposed UMFormer exhibits enhanced accuracy while maintaining an efficient running speed. The code will be freely available at: https://github.com/takeyoutime/UMFormer
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.