Language-Guided Object Localization via Refined Spotting Enhancement in Remote Sensing Imagery

IF 8.6 1区 地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2025-04-18 DOI:10.1109/TGRS.2025.3562439
Peirong Zhang;Yidan Zhang;Hui Wu;Xiaoxuan Liu;Yingyan Hou;Lei Wang
{"title":"Language-Guided Object Localization via Refined Spotting Enhancement in Remote Sensing Imagery","authors":"Peirong Zhang;Yidan Zhang;Hui Wu;Xiaoxuan Liu;Yingyan Hou;Lei Wang","doi":"10.1109/TGRS.2025.3562439","DOIUrl":null,"url":null,"abstract":"Language-guided remote sensing image object localization uses intuitive natural language interactions to locate objects of interest within satellite or drone imagery, and has a wide range of practical applications. Early research on this task typically used discriminative models that relied on predefined task heads, limited to locating a single object and lacking flexibility. In recent years, multimodal large language models (MLLMs)-based generative models leverage their language understanding ability to comprehend more complex language references and provide flexible outputs. However, these models tend to be too large and have limited accuracy in spotting dense, small objects in remote sensing scenarios. The reasons are: 1) generative models treat continuous coordinate prediction as a token classification problem, which fails to reflect the actual localization gap and 2) current methods typically use CLIP pre-trained encoders that align text-image pairs at a global level, and may not capture the fine-grained semantic object information necessary for accurate localization. To address these shortcomings, we introduce a lightweight generative model named the localization model with refined spotting enhancement (LM-RSE). Refined spotting enhancement includes the refinement of bounding box (BBox) outputs and the integration of fine-grained semantic features. Specifically, we design a BBox refinement (BBR) approach that includes a special token, a BBox decoder, and a custom regression loss function to refine the spotting precision and optimize the training process. Additionally, we propose a fine-grained semantic integration (FGSI) strategy, integrates a fine-grained image encoder, a vision-language semantic processing (VLSP) layer, and a two-stage, full-parameter training strategy, all working together to effectively enhance the granularity of spotting. Building on this, we further refine the language-guided object localization task into two types: expression-guided and class-guided. For the former, we utilize the RSVGD dataset for evaluation and achieved state-of-the-art performance; for the latter, our evaluation results surpassed those of GeoChat. Our code and checkpoints will be released at <uri>https://github.com/Zhang-Peirong/LM-RSE</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10970013/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Language-guided remote sensing image object localization uses intuitive natural language interactions to locate objects of interest within satellite or drone imagery, and has a wide range of practical applications. Early research on this task typically used discriminative models that relied on predefined task heads, limited to locating a single object and lacking flexibility. In recent years, multimodal large language models (MLLMs)-based generative models leverage their language understanding ability to comprehend more complex language references and provide flexible outputs. However, these models tend to be too large and have limited accuracy in spotting dense, small objects in remote sensing scenarios. The reasons are: 1) generative models treat continuous coordinate prediction as a token classification problem, which fails to reflect the actual localization gap and 2) current methods typically use CLIP pre-trained encoders that align text-image pairs at a global level, and may not capture the fine-grained semantic object information necessary for accurate localization. To address these shortcomings, we introduce a lightweight generative model named the localization model with refined spotting enhancement (LM-RSE). Refined spotting enhancement includes the refinement of bounding box (BBox) outputs and the integration of fine-grained semantic features. Specifically, we design a BBox refinement (BBR) approach that includes a special token, a BBox decoder, and a custom regression loss function to refine the spotting precision and optimize the training process. Additionally, we propose a fine-grained semantic integration (FGSI) strategy, integrates a fine-grained image encoder, a vision-language semantic processing (VLSP) layer, and a two-stage, full-parameter training strategy, all working together to effectively enhance the granularity of spotting. Building on this, we further refine the language-guided object localization task into two types: expression-guided and class-guided. For the former, we utilize the RSVGD dataset for evaluation and achieved state-of-the-art performance; for the latter, our evaluation results surpassed those of GeoChat. Our code and checkpoints will be released at https://github.com/Zhang-Peirong/LM-RSE.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于精细定位增强的语言引导遥感图像目标定位
语言引导遥感图像目标定位利用直观的自然语言交互来定位卫星或无人机图像中感兴趣的目标,具有广泛的实际应用。早期对该任务的研究通常使用依赖于预定义任务头的判别模型,仅限于定位单个对象并且缺乏灵活性。近年来,基于多模态大语言模型的生成模型利用其语言理解能力来理解更复杂的语言引用并提供灵活的输出。然而,这些模型往往太大,在遥感场景中发现密集的小物体的精度有限。原因是:1)生成模型将连续坐标预测视为一个标记分类问题,无法反映实际的定位差距;2)目前的方法通常使用CLIP预训练的编码器,在全局水平上对齐文本图像对,可能无法捕获精确定位所需的细粒度语义对象信息。为了解决这些缺点,我们引入了一种轻量级的生成模型,称为定位模型与精细定位增强(LM-RSE)。精细化定位增强包括边界盒(BBox)输出的精细化和细粒度语义特征的集成。具体来说,我们设计了一种BBox改进(BBR)方法,该方法包括一个特殊的令牌、一个BBox解码器和一个自定义回归损失函数,以改进定位精度并优化训练过程。此外,我们提出了一种细粒度语义集成(FGSI)策略,该策略集成了细粒度图像编码器,视觉语言语义处理(VLSP)层和两阶段全参数训练策略,所有这些都可以有效地提高定位粒度。在此基础上,我们进一步将语言导向的对象定位任务细化为两种类型:表达式导向和类导向。对于前者,我们利用RSVGD数据集进行评估并实现了最先进的性能;对于后者,我们的评价结果超过了GeoChat。我们的代码和检查点将在https://github.com/Zhang-Peirong/LM-RSE上发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Geoscience and Remote Sensing
IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理
CiteScore
11.50
自引率
28.00%
发文量
1912
审稿时长
4.0 months
期刊介绍: IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.
期刊最新文献
Mamba-MPSE: Multi-Pattern State Evolution Based on the Mamba Model for Intra-Class Heterogeneous Wetland Classification with UAV Hyperspectral Imagery Physics-Constrained Adapter-Tuning of Meteorological Foundation Models for Global SST Forecasting ReflectGAN: Modeling Vegetation Effects for Soil Carbon Estimation from Satellite Imagery Synergizing Smoke and Hotspot: A Visible-Infrared Co-Learning Framework with Dataset for Large-Scale Wildfire Detection Probabilistic Fusion Framework Based on Fully Convolutional Networks and Graphical Models for Burnt Area Detection from Multiresolution Satellite and UAV Imagery
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1