SoftFormer: SAR-optical fusion transformer for urban land use and land cover classification

IF 12.2 1区地球科学 Q1 GEOGRAPHY, PHYSICAL ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2024-12-01 Epub Date: 2024-09-20 DOI:10.1016/j.isprsjprs.2024.09.012

Rui Liu , Jing Ling , Hongsheng Zhang

{"title":"SoftFormer: SAR-optical fusion transformer for urban land use and land cover classification","authors":"Rui Liu , Jing Ling , Hongsheng Zhang","doi":"10.1016/j.isprsjprs.2024.09.012","DOIUrl":null,"url":null,"abstract":"<div><p>Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at <span><span>https://github.com/rl1024/SoftFormer</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 277-293"},"PeriodicalIF":12.2000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271624003502","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/20 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at https://github.com/rl1024/SoftFormer.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SoftFormer：用于城市土地利用和土地覆被分类的合成孔径雷达-光学融合变换器

城市土地利用和土地覆被分类对许多应用都至关重要，自然也成为遥感领域的热门话题。单模态数据所承载的信息有限，土地利用类型复杂，加上天气条件限制导致信噪比较差，这些因素都不可避免地导致分类性能相对较差。近年来，在遥感领域，利用深度学习技术进行多模态数据融合受到了广泛关注。现有研究在单一层面上展示了多模态数据的融合，但同时缺乏对流行的变换器和 CNN 结构在有效利用多模态数据方面所提供的巨大潜力的挖掘，这可能会陷入信息融合不足的陷阱。我们介绍的 SoftFormer 是一种新型网络，它协同融合了 CNN 和变换器的优势，并实现了多级融合。为了从图像中提取局部特征，我们提出了一种称为 "内部自注意 "的创新机制，并将其无缝集成到主干网络中。为了充分利用两种模态的全局语义信息，在特征级融合中，我们引入了一种联合键值学习融合方法，在统一的语义空间内整合多模态数据。决策级信息和特征级信息同时融合，形成多级融合转换器网络。对四个遥感数据集的研究结果表明，与其他最先进的方法相比，SoftFormer 在总体准确率、kappa 指数和 mIoU 方面至少能实现 1.32%、0.7% 和 0.99% 的性能提升，消融研究表明，在城市土地覆被和土地利用分类方面，多模态融合优于单模态数据，总体准确率、kappa 指数以及 mIoU 的最高提升可达 5.71%、10.32% 和 7.91%，即使在有云层覆盖的情况下，所提出的模块也能在一定程度上提高性能。代码将在 https://github.com/rl1024/SoftFormer 上公开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.