Peirong Zhang;Yidan Zhang;Hui Wu;Xiaoxuan Liu;Yingyan Hou;Lei Wang
{"title":"Language-Guided Object Localization via Refined Spotting Enhancement in Remote Sensing Imagery","authors":"Peirong Zhang;Yidan Zhang;Hui Wu;Xiaoxuan Liu;Yingyan Hou;Lei Wang","doi":"10.1109/TGRS.2025.3562439","DOIUrl":null,"url":null,"abstract":"Language-guided remote sensing image object localization uses intuitive natural language interactions to locate objects of interest within satellite or drone imagery, and has a wide range of practical applications. Early research on this task typically used discriminative models that relied on predefined task heads, limited to locating a single object and lacking flexibility. In recent years, multimodal large language models (MLLMs)-based generative models leverage their language understanding ability to comprehend more complex language references and provide flexible outputs. However, these models tend to be too large and have limited accuracy in spotting dense, small objects in remote sensing scenarios. The reasons are: 1) generative models treat continuous coordinate prediction as a token classification problem, which fails to reflect the actual localization gap and 2) current methods typically use CLIP pre-trained encoders that align text-image pairs at a global level, and may not capture the fine-grained semantic object information necessary for accurate localization. To address these shortcomings, we introduce a lightweight generative model named the localization model with refined spotting enhancement (LM-RSE). Refined spotting enhancement includes the refinement of bounding box (BBox) outputs and the integration of fine-grained semantic features. Specifically, we design a BBox refinement (BBR) approach that includes a special token, a BBox decoder, and a custom regression loss function to refine the spotting precision and optimize the training process. Additionally, we propose a fine-grained semantic integration (FGSI) strategy, integrates a fine-grained image encoder, a vision-language semantic processing (VLSP) layer, and a two-stage, full-parameter training strategy, all working together to effectively enhance the granularity of spotting. Building on this, we further refine the language-guided object localization task into two types: expression-guided and class-guided. For the former, we utilize the RSVGD dataset for evaluation and achieved state-of-the-art performance; for the latter, our evaluation results surpassed those of GeoChat. Our code and checkpoints will be released at <uri>https://github.com/Zhang-Peirong/LM-RSE</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10970013/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Language-guided remote sensing image object localization uses intuitive natural language interactions to locate objects of interest within satellite or drone imagery, and has a wide range of practical applications. Early research on this task typically used discriminative models that relied on predefined task heads, limited to locating a single object and lacking flexibility. In recent years, multimodal large language models (MLLMs)-based generative models leverage their language understanding ability to comprehend more complex language references and provide flexible outputs. However, these models tend to be too large and have limited accuracy in spotting dense, small objects in remote sensing scenarios. The reasons are: 1) generative models treat continuous coordinate prediction as a token classification problem, which fails to reflect the actual localization gap and 2) current methods typically use CLIP pre-trained encoders that align text-image pairs at a global level, and may not capture the fine-grained semantic object information necessary for accurate localization. To address these shortcomings, we introduce a lightweight generative model named the localization model with refined spotting enhancement (LM-RSE). Refined spotting enhancement includes the refinement of bounding box (BBox) outputs and the integration of fine-grained semantic features. Specifically, we design a BBox refinement (BBR) approach that includes a special token, a BBox decoder, and a custom regression loss function to refine the spotting precision and optimize the training process. Additionally, we propose a fine-grained semantic integration (FGSI) strategy, integrates a fine-grained image encoder, a vision-language semantic processing (VLSP) layer, and a two-stage, full-parameter training strategy, all working together to effectively enhance the granularity of spotting. Building on this, we further refine the language-guided object localization task into two types: expression-guided and class-guided. For the former, we utilize the RSVGD dataset for evaluation and achieved state-of-the-art performance; for the latter, our evaluation results surpassed those of GeoChat. Our code and checkpoints will be released at https://github.com/Zhang-Peirong/LM-RSE.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.