Language-Guided Object Localization via Refined Spotting Enhancement in Remote Sensing Imagery

IF 8.6 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2025-04-18 DOI:10.1109/TGRS.2025.3562439

Peirong Zhang;Yidan Zhang;Hui Wu;Xiaoxuan Liu;Yingyan Hou;Lei Wang

{"title":"Language-Guided Object Localization via Refined Spotting Enhancement in Remote Sensing Imagery","authors":"Peirong Zhang;Yidan Zhang;Hui Wu;Xiaoxuan Liu;Yingyan Hou;Lei Wang","doi":"10.1109/TGRS.2025.3562439","DOIUrl":null,"url":null,"abstract":"Language-guided remote sensing image object localization uses intuitive natural language interactions to locate objects of interest within satellite or drone imagery, and has a wide range of practical applications. Early research on this task typically used discriminative models that relied on predefined task heads, limited to locating a single object and lacking flexibility. In recent years, multimodal large language models (MLLMs)-based generative models leverage their language understanding ability to comprehend more complex language references and provide flexible outputs. However, these models tend to be too large and have limited accuracy in spotting dense, small objects in remote sensing scenarios. The reasons are: 1) generative models treat continuous coordinate prediction as a token classification problem, which fails to reflect the actual localization gap and 2) current methods typically use CLIP pre-trained encoders that align text-image pairs at a global level, and may not capture the fine-grained semantic object information necessary for accurate localization. To address these shortcomings, we introduce a lightweight generative model named the localization model with refined spotting enhancement (LM-RSE). Refined spotting enhancement includes the refinement of bounding box (BBox) outputs and the integration of fine-grained semantic features. Specifically, we design a BBox refinement (BBR) approach that includes a special token, a BBox decoder, and a custom regression loss function to refine the spotting precision and optimize the training process. Additionally, we propose a fine-grained semantic integration (FGSI) strategy, integrates a fine-grained image encoder, a vision-language semantic processing (VLSP) layer, and a two-stage, full-parameter training strategy, all working together to effectively enhance the granularity of spotting. Building on this, we further refine the language-guided object localization task into two types: expression-guided and class-guided. For the former, we utilize the RSVGD dataset for evaluation and achieved state-of-the-art performance; for the latter, our evaluation results surpassed those of GeoChat. Our code and checkpoints will be released at <uri>https://github.com/Zhang-Peirong/LM-RSE</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10970013/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Language-guided remote sensing image object localization uses intuitive natural language interactions to locate objects of interest within satellite or drone imagery, and has a wide range of practical applications. Early research on this task typically used discriminative models that relied on predefined task heads, limited to locating a single object and lacking flexibility. In recent years, multimodal large language models (MLLMs)-based generative models leverage their language understanding ability to comprehend more complex language references and provide flexible outputs. However, these models tend to be too large and have limited accuracy in spotting dense, small objects in remote sensing scenarios. The reasons are: 1) generative models treat continuous coordinate prediction as a token classification problem, which fails to reflect the actual localization gap and 2) current methods typically use CLIP pre-trained encoders that align text-image pairs at a global level, and may not capture the fine-grained semantic object information necessary for accurate localization. To address these shortcomings, we introduce a lightweight generative model named the localization model with refined spotting enhancement (LM-RSE). Refined spotting enhancement includes the refinement of bounding box (BBox) outputs and the integration of fine-grained semantic features. Specifically, we design a BBox refinement (BBR) approach that includes a special token, a BBox decoder, and a custom regression loss function to refine the spotting precision and optimize the training process. Additionally, we propose a fine-grained semantic integration (FGSI) strategy, integrates a fine-grained image encoder, a vision-language semantic processing (VLSP) layer, and a two-stage, full-parameter training strategy, all working together to effectively enhance the granularity of spotting. Building on this, we further refine the language-guided object localization task into two types: expression-guided and class-guided. For the former, we utilize the RSVGD dataset for evaluation and achieved state-of-the-art performance; for the latter, our evaluation results surpassed those of GeoChat. Our code and checkpoints will be released at https://github.com/Zhang-Peirong/LM-RSE.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于精细定位增强的语言引导遥感图像目标定位

语言引导遥感图像目标定位利用直观的自然语言交互来定位卫星或无人机图像中感兴趣的目标，具有广泛的实际应用。早期对该任务的研究通常使用依赖于预定义任务头的判别模型，仅限于定位单个对象并且缺乏灵活性。近年来，基于多模态大语言模型的生成模型利用其语言理解能力来理解更复杂的语言引用并提供灵活的输出。然而，这些模型往往太大，在遥感场景中发现密集的小物体的精度有限。原因是：1)生成模型将连续坐标预测视为一个标记分类问题，无法反映实际的定位差距；2)目前的方法通常使用CLIP预训练的编码器，在全局水平上对齐文本图像对，可能无法捕获精确定位所需的细粒度语义对象信息。为了解决这些缺点，我们引入了一种轻量级的生成模型，称为定位模型与精细定位增强（LM-RSE）。精细化定位增强包括边界盒（BBox）输出的精细化和细粒度语义特征的集成。具体来说，我们设计了一种BBox改进（BBR）方法，该方法包括一个特殊的令牌、一个BBox解码器和一个自定义回归损失函数，以改进定位精度并优化训练过程。此外，我们提出了一种细粒度语义集成（FGSI）策略，该策略集成了细粒度图像编码器，视觉语言语义处理（VLSP）层和两阶段全参数训练策略，所有这些都可以有效地提高定位粒度。在此基础上，我们进一步将语言导向的对象定位任务细化为两种类型：表达式导向和类导向。对于前者，我们利用RSVGD数据集进行评估并实现了最先进的性能；对于后者，我们的评价结果超过了GeoChat。我们的代码和检查点将在https://github.com/Zhang-Peirong/LM-RSE上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.