{"title":"UM2Former: U-Shaped Multimixed Transformer Network for Large-Scale Hyperspectral Image Semantic Segmentation","authors":"Aijun Xu;Zhaohui Xue;Ziyu Li;Shun Cheng;Hongjun Su;Junshi Xia","doi":"10.1109/TGRS.2025.3543821","DOIUrl":null,"url":null,"abstract":"Transformer-based deep learning (DL) methods have gradually been advocated for remote sensing (RS) image semantic segmentation due to the great global modeling capability. Nevertheless, Transformer-based DL methods have not yet been sufficiently explored on the large-scale hyperspectral image (HSI) semantic segmentation. Current algorithms lack a comprehensive consideration of the impact of positional encoding (PE) interpolation when constructing Transformer-based decoders. Moreover, existing segmentation heads usually directly concatenate multiscale features to achieve segmentation, which ignores the inherent semantic differences between different features. To address the above issues, a U-shaped multimixed Transformer network (UM2Former) is proposed for large-scale HSI semantic segmentation. First, a weight encoder consisting of two modules, the overlap-down and the channel-weight, is built to extract hierarchical discriminative spectral-spatial features and decrease spectral redundancy. Second, the proposed multimixed Transformer block (MMTB) develops a PE-free module, spatial-feature-retention attention (SFRA) mechanism, in which “multimixed” represents the global dependency modeling of each pixel with the retented average spatial characteristics of different locations in the input feature maps. Finally, a linear fuse segmentation head (LFSH) is designed to align semantic information among multiscale feature maps and achieve accurate segmentation. Experiments were conducted in single cities and the entire large-scale WHU-OHS HSI dataset. The segmentation results indicated that the proposed method achieved higher accuracy compared to the existing semantic segmentation methods, with performance improvements of 17.80% and 4.16% in terms of intersection over union (mIoU) and overall accuracy (OA), respectively. The source code will be available at <uri>https://github.com/ZhaohuiXue/</uri> UM2Former.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-21"},"PeriodicalIF":8.6000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10892222/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Transformer-based deep learning (DL) methods have gradually been advocated for remote sensing (RS) image semantic segmentation due to the great global modeling capability. Nevertheless, Transformer-based DL methods have not yet been sufficiently explored on the large-scale hyperspectral image (HSI) semantic segmentation. Current algorithms lack a comprehensive consideration of the impact of positional encoding (PE) interpolation when constructing Transformer-based decoders. Moreover, existing segmentation heads usually directly concatenate multiscale features to achieve segmentation, which ignores the inherent semantic differences between different features. To address the above issues, a U-shaped multimixed Transformer network (UM2Former) is proposed for large-scale HSI semantic segmentation. First, a weight encoder consisting of two modules, the overlap-down and the channel-weight, is built to extract hierarchical discriminative spectral-spatial features and decrease spectral redundancy. Second, the proposed multimixed Transformer block (MMTB) develops a PE-free module, spatial-feature-retention attention (SFRA) mechanism, in which “multimixed” represents the global dependency modeling of each pixel with the retented average spatial characteristics of different locations in the input feature maps. Finally, a linear fuse segmentation head (LFSH) is designed to align semantic information among multiscale feature maps and achieve accurate segmentation. Experiments were conducted in single cities and the entire large-scale WHU-OHS HSI dataset. The segmentation results indicated that the proposed method achieved higher accuracy compared to the existing semantic segmentation methods, with performance improvements of 17.80% and 4.16% in terms of intersection over union (mIoU) and overall accuracy (OA), respectively. The source code will be available at https://github.com/ZhaohuiXue/ UM2Former.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.