{"title":"DF2RQ: Dynamic Feature Fusion via Region-Wise Queries for Semantic Segmentation of Multimodal Remote Sensing Data","authors":"Shiyang Feng;Zhaowei Li;Bo Zhang;Bin Wang","doi":"10.1109/TGRS.2025.3526247","DOIUrl":null,"url":null,"abstract":"Although remote sensing (RS) data with multiple modalities can be used to significantly improve the accuracy of semantic segmentation in RS data, how to effectively extract multimodal information through multimodal feature fusion remains a challenging task. Specifically, existing methods for multimodal feature fusion still face two major challenges: 1) due to the diverse imaging mechanisms of multimodal RS data, the boundaries of the same foreground may vary across different modalities, leading to the inclusion of unwanted background semantics in the fused foreground features, and 2) RS data from different modalities exhibit varying discriminative abilities for different foregrounds, making it challenging to determine the proportion of semantic information for each modality in the fusion results. To address the above issues, we propose a dynamic feature fusion method based on region-wise queries, namely, DF2RQ, for SS of multimodal RS data. This method is primarily composed of two components: the spatial reconstruction (SR) module and the dynamic fusion (DF) module. Within the SR module, we propose an SR scheme that samples foreground features from different modalities, achieving independent reconstruction of different unimodal features, thereby alleviating the semantic mixing between foreground and background across modalities. In the DF module, a feature fusion scheme based on unimodal feature reference positions is proposed to obtain fusion weights for each modality, thereby enabling the DF of complementary features from multiple modalities. The performance of the proposed method has been extensively evaluated on various multimodal RS datasets for SS, and the experimental results consistently show that the proposed method achieves state-of-the-art (SOTA) accuracy on multiple commonly used metrics. In addition, our code is available at <uri>https://github.com/I3ab/DF2RQ</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10829634/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Although remote sensing (RS) data with multiple modalities can be used to significantly improve the accuracy of semantic segmentation in RS data, how to effectively extract multimodal information through multimodal feature fusion remains a challenging task. Specifically, existing methods for multimodal feature fusion still face two major challenges: 1) due to the diverse imaging mechanisms of multimodal RS data, the boundaries of the same foreground may vary across different modalities, leading to the inclusion of unwanted background semantics in the fused foreground features, and 2) RS data from different modalities exhibit varying discriminative abilities for different foregrounds, making it challenging to determine the proportion of semantic information for each modality in the fusion results. To address the above issues, we propose a dynamic feature fusion method based on region-wise queries, namely, DF2RQ, for SS of multimodal RS data. This method is primarily composed of two components: the spatial reconstruction (SR) module and the dynamic fusion (DF) module. Within the SR module, we propose an SR scheme that samples foreground features from different modalities, achieving independent reconstruction of different unimodal features, thereby alleviating the semantic mixing between foreground and background across modalities. In the DF module, a feature fusion scheme based on unimodal feature reference positions is proposed to obtain fusion weights for each modality, thereby enabling the DF of complementary features from multiple modalities. The performance of the proposed method has been extensively evaluated on various multimodal RS datasets for SS, and the experimental results consistently show that the proposed method achieves state-of-the-art (SOTA) accuracy on multiple commonly used metrics. In addition, our code is available at https://github.com/I3ab/DF2RQ.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.