Lilu Zhu;Xiaolu Su;Jiaxuan Tang;Yanfeng Hu;Yang Wang
{"title":"View-Based Knowledge-Augmented Multimodal Semantic Understanding for Optical Remote Sensing Images","authors":"Lilu Zhu;Xiaolu Su;Jiaxuan Tang;Yanfeng Hu;Yang Wang","doi":"10.1109/TGRS.2025.3532349","DOIUrl":null,"url":null,"abstract":"Optical remote sensing (RS) images serve as a pivotal source of geographic information. Due to the continuous development of deep learning technology, the evolving demands for multisource optical RS of the public shifted from recognition and acquisition of explicit features to comprehension and application of the fine-grained semantics and relationships implied in images. To address this challenge, we propose a semantic-augmented approach integrated multiview knowledge graph for a comprehensive understanding of optical RS images (RSMVKF). The RSMVKF delves into the structured representations of external knowledge from different human-like cognitive views and further explores the discovery ability of high-level features on the basis of multiple modalities and granularities. Specifically, the RSMVKF consists of two stages. First, we guide a large language model (LLM) to condense relevant knowledge from lengthy external knowledge passages and generate a view-level knowledge graph (RS-VKG). Then, an asymmetric multimodal contrastive network model (RS-M2CL) is designed to investigate efficient semantic augmentation. In this way, two types of contrastive loss functions, cross-modal and cross-granularity, are adopted to improve the understanding of implicit knowledge. The experimental results demonstrate that the RSMVKF greatly improves several perception tasks and reasoning tasks with rich features in optical RS imagery. In particular, in perception tasks such as fine-grained object detection and k-nearest neighbor (KNN) retrieval, the RSMVKF yields enhancements of 6.7% and 8.1%, respectively. In addition, in knowledge-driven reasoning tasks such as RS image captioning (RSCP), RS visual grounding (RSVG), and RS visual question answering (RSVQA), the RSMVKF demonstrates superior performance with margins of 8.9%, 5.3%, and 11.4%, respectively.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-33"},"PeriodicalIF":8.6000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10848141/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Optical remote sensing (RS) images serve as a pivotal source of geographic information. Due to the continuous development of deep learning technology, the evolving demands for multisource optical RS of the public shifted from recognition and acquisition of explicit features to comprehension and application of the fine-grained semantics and relationships implied in images. To address this challenge, we propose a semantic-augmented approach integrated multiview knowledge graph for a comprehensive understanding of optical RS images (RSMVKF). The RSMVKF delves into the structured representations of external knowledge from different human-like cognitive views and further explores the discovery ability of high-level features on the basis of multiple modalities and granularities. Specifically, the RSMVKF consists of two stages. First, we guide a large language model (LLM) to condense relevant knowledge from lengthy external knowledge passages and generate a view-level knowledge graph (RS-VKG). Then, an asymmetric multimodal contrastive network model (RS-M2CL) is designed to investigate efficient semantic augmentation. In this way, two types of contrastive loss functions, cross-modal and cross-granularity, are adopted to improve the understanding of implicit knowledge. The experimental results demonstrate that the RSMVKF greatly improves several perception tasks and reasoning tasks with rich features in optical RS imagery. In particular, in perception tasks such as fine-grained object detection and k-nearest neighbor (KNN) retrieval, the RSMVKF yields enhancements of 6.7% and 8.1%, respectively. In addition, in knowledge-driven reasoning tasks such as RS image captioning (RSCP), RS visual grounding (RSVG), and RS visual question answering (RSVQA), the RSMVKF demonstrates superior performance with margins of 8.9%, 5.3%, and 11.4%, respectively.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.