{"title":"A Multimodal Unified Representation Learning Framework With Masked Image Modeling for Remote Sensing Images","authors":"Dakuan Du;Tianzhu Liu;Yanfeng Gu","doi":"10.1109/TGRS.2024.3494244","DOIUrl":null,"url":null,"abstract":"The coordinated utilization of diverse types of satellite sensors provides a more comprehensive view of the Earth’s surface. However, due to the significant heterogeneity across modalities and the scarcity of high-quality labels, most existing methods face bottlenecks in the underutilization of massive unlabeled multimodal satellite data, making it challenging to understand the scene comprehensively. To this end, we propose a multimodal unified representation learning framework (MURLF) based on masked image modeling (MIM) for remote sensing (RS) images, aiming to make better use of massive unlabeled multimodal RS data. MURLF leverages the consistency and complementarity relationships among modalities to extract both common and distinctive features, mitigating the challenges faced by encoders due to significant heterogeneity across various data types. In addition, MURLF uses multilevel masking independently across different modalities, using visual tokens both within the same modality and across modalities to jointly recover masked pixels as the pretext task, facilitating comprehensive cross-modal information interaction. Furthermore, we design a preselected sensor-specific feature extractor (PSFE) to exploit the heterogeneous characteristics of various data sources, thereby extracting discriminative features. By integrating the multistage PSFE with the ViT backbone, MURLF can naturally extract multimodal hierarchical representations for downstream tasks, fully preserving valuable information from each modality. The proposed MURLF is not restricted to multimodal inputs but also supports single-modal inputs during the fine-tuning stage, significantly broadening the framework’s application. Extensive experiments across multiple tasks demonstrate the superiority of the proposed MURLF compared with several advanced multimodal models. The code will be released soon.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-16"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10756791/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
The coordinated utilization of diverse types of satellite sensors provides a more comprehensive view of the Earth’s surface. However, due to the significant heterogeneity across modalities and the scarcity of high-quality labels, most existing methods face bottlenecks in the underutilization of massive unlabeled multimodal satellite data, making it challenging to understand the scene comprehensively. To this end, we propose a multimodal unified representation learning framework (MURLF) based on masked image modeling (MIM) for remote sensing (RS) images, aiming to make better use of massive unlabeled multimodal RS data. MURLF leverages the consistency and complementarity relationships among modalities to extract both common and distinctive features, mitigating the challenges faced by encoders due to significant heterogeneity across various data types. In addition, MURLF uses multilevel masking independently across different modalities, using visual tokens both within the same modality and across modalities to jointly recover masked pixels as the pretext task, facilitating comprehensive cross-modal information interaction. Furthermore, we design a preselected sensor-specific feature extractor (PSFE) to exploit the heterogeneous characteristics of various data sources, thereby extracting discriminative features. By integrating the multistage PSFE with the ViT backbone, MURLF can naturally extract multimodal hierarchical representations for downstream tasks, fully preserving valuable information from each modality. The proposed MURLF is not restricted to multimodal inputs but also supports single-modal inputs during the fine-tuning stage, significantly broadening the framework’s application. Extensive experiments across multiple tasks demonstrate the superiority of the proposed MURLF compared with several advanced multimodal models. The code will be released soon.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.