Ship detection in remote sensing images plays an important role in various maritime activities. However, the existing deep learning methods face challenges, such as changes in ship target size, complex backgrounds, and noise interference in remote sensing images, which can lead to low detection accuracy and incomplete target detection. To address these issues, we proposed a synthetic aperture radar (SAR) image target detection framework called SDWPNet, aimed at improving target detection performance in complex scenes. First, we proposed SDWavetpool (SDW), which optimizes feature downsampling through multiscale wavelet features, effectively reducing the dimensionality of the feature map while preserving the detailed information of small targets. It can more accurately identify medium and large targets in complex backgrounds, fully utilizing multilevel features. Then, the network structure was optimized using a feature extraction module that combines the PPA mechanism, making it more focused on the details of small targets. In addition, we further improved the detection accuracy by improving the loss function (ICMPIoU). The experiments on the SAR ship detection dataset (SSDD) and high-resolution SAR image dataset (HRSID) show that this framework performs well in both accuracy and response speed of target detection, achieving 74.5% and 67.6% in $mathbf {mAP_{.50:.95}}$ , using only parameter 2.97 M.
{"title":"SDWPNet: A Downsampling-Driven Network for SAR Ship Detection With Refined Features and Optimized Loss","authors":"Xingyu Hu;Hongyu Chen;Yugang Chang;Xue Yang;Weiming Zeng","doi":"10.1109/LGRS.2025.3629377","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3629377","url":null,"abstract":"Ship detection in remote sensing images plays an important role in various maritime activities. However, the existing deep learning methods face challenges, such as changes in ship target size, complex backgrounds, and noise interference in remote sensing images, which can lead to low detection accuracy and incomplete target detection. To address these issues, we proposed a synthetic aperture radar (SAR) image target detection framework called SDWPNet, aimed at improving target detection performance in complex scenes. First, we proposed SDWavetpool (SDW), which optimizes feature downsampling through multiscale wavelet features, effectively reducing the dimensionality of the feature map while preserving the detailed information of small targets. It can more accurately identify medium and large targets in complex backgrounds, fully utilizing multilevel features. Then, the network structure was optimized using a feature extraction module that combines the PPA mechanism, making it more focused on the details of small targets. In addition, we further improved the detection accuracy by improving the loss function (ICMPIoU). The experiments on the SAR ship detection dataset (SSDD) and high-resolution SAR image dataset (HRSID) show that this framework performs well in both accuracy and response speed of target detection, achieving 74.5% and 67.6% in <inline-formula> <tex-math>$mathbf {mAP_{.50:.95}}$ </tex-math></inline-formula>, using only parameter 2.97 M.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"23 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-05DOI: 10.1109/LGRS.2025.3629303
Elman Ghazaei;Erchan Aptoula
ConvNets and Vision Transformers (ViTs) have been widely used for change detection (CD), though they exhibit limitations: long-range dependencies are not effectively captured by the former, while the latter are associated with high computational demands. Vision Mamba, based on State Space Models, has been proposed as an alternative, yet has been primarily utilized as a feature extraction backbone. In this work, the change state space model (CSSM) is introduced as a task-specific approach for CD, designed to focus exclusively on relevant changes between bitemporal images while filtering out irrelevant information. Through this design, the number of parameters is reduced, computational efficiency is improved, and robustness is enhanced. CSSM is evaluated on three benchmark datasets, where superior performance is achieved compared to ConvNets, ViTs, and Mamba-based models, at a significantly lower computational cost. The code will be made publicly available at https://github.com/Elman295/CSSM upon acceptance
{"title":"Efficient Remote Sensing Change Detection With Change State Space Models","authors":"Elman Ghazaei;Erchan Aptoula","doi":"10.1109/LGRS.2025.3629303","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3629303","url":null,"abstract":"ConvNets and Vision Transformers (ViTs) have been widely used for change detection (CD), though they exhibit limitations: long-range dependencies are not effectively captured by the former, while the latter are associated with high computational demands. Vision Mamba, based on State Space Models, has been proposed as an alternative, yet has been primarily utilized as a feature extraction backbone. In this work, the change state space model (CSSM) is introduced as a task-specific approach for CD, designed to focus exclusively on relevant changes between bitemporal images while filtering out irrelevant information. Through this design, the number of parameters is reduced, computational efficiency is improved, and robustness is enhanced. CSSM is evaluated on three benchmark datasets, where superior performance is achieved compared to ConvNets, ViTs, and Mamba-based models, at a significantly lower computational cost. The code will be made publicly available at <uri>https://github.com/Elman295/CSSM</uri> upon acceptance","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"23 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/LGRS.2025.3626855
Dat Minh-Tien Nguyen;Thien Huynh-The
Remote sensing object detection faces challenges such as small object sizes, complex backgrounds, and computational constraints. To overcome these challenges, we propose XSNet, an efficient deep learning (DL) model proficiently designed to enhance feature representation and multiscale detection. Concretely, XSNet introduces three key innovations: swin-involution transformer (SIner) to improve local self-attention and spatial adaptability, positional weight bi-level routing attention (PosWeightRA) to refine spatial awareness and preserve positional encoding, and an X-shaped multiscale feature fusion strategy to optimize feature aggregation while reducing computational cost. These components collectively improve detection accuracy, particularly for small and overlapping objects. Through extensive experiments, XSNet achieves impressive mAP0.5 and mAP0.95 scores of 47.1% and 28.2% on VisDrone2019, and 92.9% and 66.0% on RSOD. It outperforms state-of-the-art models while maintaining a compact size of 7.11 million parameters and fast inference time of 35.5 ms, making it well-suited for real-time remote sensing in resource-constrained environments.
{"title":"XSNet: Lightweight Object Detection Model Using X-Shaped Architecture in Remote Sensing Images","authors":"Dat Minh-Tien Nguyen;Thien Huynh-The","doi":"10.1109/LGRS.2025.3626855","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3626855","url":null,"abstract":"Remote sensing object detection faces challenges such as small object sizes, complex backgrounds, and computational constraints. To overcome these challenges, we propose XSNet, an efficient deep learning (DL) model proficiently designed to enhance feature representation and multiscale detection. Concretely, XSNet introduces three key innovations: swin-involution transformer (SIner) to improve local self-attention and spatial adaptability, positional weight bi-level routing attention (PosWeightRA) to refine spatial awareness and preserve positional encoding, and an X-shaped multiscale feature fusion strategy to optimize feature aggregation while reducing computational cost. These components collectively improve detection accuracy, particularly for small and overlapping objects. Through extensive experiments, XSNet achieves impressive mAP0.5 and mAP0.95 scores of 47.1% and 28.2% on VisDrone2019, and 92.9% and 66.0% on RSOD. It outperforms state-of-the-art models while maintaining a compact size of 7.11 million parameters and fast inference time of 35.5 ms, making it well-suited for real-time remote sensing in resource-constrained environments.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/LGRS.2025.3626786
Binge Cui;Shengyun Liu;Jing Zhang;Yan Lu
Coastline extraction from remote sensing imagery is persistently challenged by intra-class heterogeneity (e.g., diverse coastline types) and boundary ambiguity. Existing methods often exhibit suboptimal performance in complex scenes mixing artificial and natural landforms, as they tend to ignore coastline morphological priors and struggle to recover details in low-contrast regions. To address these issues, this letter introduces TopoSegNet, a novel collaborative framework centered on a dual-decoder architecture. A segmentation decoder utilizes a morphology-aware attention (MAA) module to adaptively decouple and model diverse coastline morphologies and a structure-detail synergistic enhancement (SDSE) module to reconstruct weak boundaries with high fidelity. Meanwhile, a learnable topology decoder frames topology construction as a graph reasoning task, which ensures the geometric and topological integrity of the final vector output. TopoSegNet was evaluated on the public Landsat-8 and a custom Lianyungang Gaofen-1 (GF-1) dataset. The experimental results show that the proposed method reached 98.64%, 66.80%, and 0.795% on the mIoU, BIoU, and average path length similarity (APLS) metrics, respectively, verifying its validity and superiority. Compared to the state-of-the-art methods, the TopoSegNet model demonstrates significantly higher accuracy and topological fidelity.
{"title":"TopoSegNet: Enhancing Geometric Fidelity of Coastline Extraction via a Joint Segmentation and Topological Reasoning Framework","authors":"Binge Cui;Shengyun Liu;Jing Zhang;Yan Lu","doi":"10.1109/LGRS.2025.3626786","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3626786","url":null,"abstract":"Coastline extraction from remote sensing imagery is persistently challenged by intra-class heterogeneity (e.g., diverse coastline types) and boundary ambiguity. Existing methods often exhibit suboptimal performance in complex scenes mixing artificial and natural landforms, as they tend to ignore coastline morphological priors and struggle to recover details in low-contrast regions. To address these issues, this letter introduces TopoSegNet, a novel collaborative framework centered on a dual-decoder architecture. A segmentation decoder utilizes a morphology-aware attention (MAA) module to adaptively decouple and model diverse coastline morphologies and a structure-detail synergistic enhancement (SDSE) module to reconstruct weak boundaries with high fidelity. Meanwhile, a learnable topology decoder frames topology construction as a graph reasoning task, which ensures the geometric and topological integrity of the final vector output. TopoSegNet was evaluated on the public Landsat-8 and a custom Lianyungang Gaofen-1 (GF-1) dataset. The experimental results show that the proposed method reached 98.64%, 66.80%, and 0.795% on the mIoU, BIoU, and average path length similarity (APLS) metrics, respectively, verifying its validity and superiority. Compared to the state-of-the-art methods, the TopoSegNet model demonstrates significantly higher accuracy and topological fidelity.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"23 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning approaches that jointly learn feature extraction have achieved remarkable progress in image matching. However, current methods often treat central and neighboring pixels uniformly and use static feature selection strategies that fail to account for environmental variations. This results in limited robustness of descriptors and keypoints, thereby affecting matching accuracy. To address these limitations, we propose a robust joint optimization network for feature detection and description in optical and SAR image matching. A center-weighted module (CWM) is designed to enhance local feature representation by emphasizing the hierarchical relationship between central and surrounding features. Furthermore, a multiscale gated aggregation (MSGA) module is introduced to suppress redundant responses and improve keypoint discriminability through a gating mechanism. To address the inconsistency of score maps across heterogeneous modalities, we design a position-constrained repeatability loss to guide the network in learning stable and consistent keypoint correspondences. Experimental results across various scenarios demonstrate that the proposed method outperforms state-of-the-art techniques in terms of both matching accuracy and the number of correct matches, highlighting its robustness and effectiveness.
{"title":"A Robust Joint Optimization Network for Feature Detection and Description in Optical and SAR Image Matching","authors":"Xinshan Zhang;Zhitao Fu;Menghua Li;Shaochen Zhang;Han Nie;Bo-Hui Tang","doi":"10.1109/LGRS.2025.3626750","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3626750","url":null,"abstract":"Deep learning approaches that jointly learn feature extraction have achieved remarkable progress in image matching. However, current methods often treat central and neighboring pixels uniformly and use static feature selection strategies that fail to account for environmental variations. This results in limited robustness of descriptors and keypoints, thereby affecting matching accuracy. To address these limitations, we propose a robust joint optimization network for feature detection and description in optical and SAR image matching. A center-weighted module (CWM) is designed to enhance local feature representation by emphasizing the hierarchical relationship between central and surrounding features. Furthermore, a multiscale gated aggregation (MSGA) module is introduced to suppress redundant responses and improve keypoint discriminability through a gating mechanism. To address the inconsistency of score maps across heterogeneous modalities, we design a position-constrained repeatability loss to guide the network in learning stable and consistent keypoint correspondences. Experimental results across various scenarios demonstrate that the proposed method outperforms state-of-the-art techniques in terms of both matching accuracy and the number of correct matches, highlighting its robustness and effectiveness.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"23 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145537627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1109/LGRS.2025.3626369
Haowen Jin;Yuankang Ye;Chang Liu;Feng Gao
Precipitation nowcasting using radar echo data is critical for issuing timely extreme weather warnings, yet the existing models struggle to balance computational efficiency with prediction accuracy when modeling complex, nonlinear echo sequences. To address these challenges, we propose MambaCast, a novel dual-branch precipitation nowcasting model built upon the Mamba framework. Specifically, MambaCast incorporates three key components: a state-space model (SSM) branch, a convolutional neural network (CNN) branch and a CastFusion module. The SSM branch captures global low-frequency evolution features in the radar echo field through a selective scanning mechanism, while the CNN branch extracts local high-frequency transient features using gated spatiotemporal attention (gSTA). The CastFusion module dynamically integrates features across different frequency scales, enabling adaptive fusion of spatiotemporal distribution. Experiments on two public radar datasets show that MambaCast consistently outperforms baseline models.
{"title":"MambaCast: An Efficient Precipitation Nowcasting Model With Dual-Branch Mamba","authors":"Haowen Jin;Yuankang Ye;Chang Liu;Feng Gao","doi":"10.1109/LGRS.2025.3626369","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3626369","url":null,"abstract":"Precipitation nowcasting using radar echo data is critical for issuing timely extreme weather warnings, yet the existing models struggle to balance computational efficiency with prediction accuracy when modeling complex, nonlinear echo sequences. To address these challenges, we propose MambaCast, a novel dual-branch precipitation nowcasting model built upon the Mamba framework. Specifically, MambaCast incorporates three key components: a state-space model (SSM) branch, a convolutional neural network (CNN) branch and a CastFusion module. The SSM branch captures global low-frequency evolution features in the radar echo field through a selective scanning mechanism, while the CNN branch extracts local high-frequency transient features using gated spatiotemporal attention (gSTA). The CastFusion module dynamically integrates features across different frequency scales, enabling adaptive fusion of spatiotemporal distribution. Experiments on two public radar datasets show that MambaCast consistently outperforms baseline models.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"23 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145537628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20DOI: 10.1109/LGRS.2025.3623244
Aybora Köksal;A. Aydın Alatan
Remote sensing (RS) applications often rely on edge hardware that cannot host the models in the 7B parametric vision language of today. This letter presents TinyRS, the first 2B-parameter vision language models (VLMs) optimized for RS, and TinyRS-R1, its reasoning-augmented variant. Based on Qwen2-VL-2B, TinyRS is trained via a four-stage pipeline: pretraining on million-scale satellite images, instruction tuning, fine-tuning with chain-of-thought (CoT) annotations from a new reasoning dataset, and group relative policy optimization (GRPO)-based alignment. TinyRS-R1 matches or surpasses recent 7B RS models in classification, visual question answering (VQA), grounding, and open-ended QA—while using one third of the memory and latency. CoT reasoning improves grounding and scene understanding, while TinyRS excels at concise, low-latency VQA. TinyRS-R1 is the first domain-specialized small VLM with GRPO-aligned CoT reasoning for general-purpose RS. The code, models, and caption datasets are available at https://github.com/aybora/TinyRS
{"title":"TinyRS-R1: Compact Vision Language Model for Remote Sensing","authors":"Aybora Köksal;A. Aydın Alatan","doi":"10.1109/LGRS.2025.3623244","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3623244","url":null,"abstract":"Remote sensing (RS) applications often rely on edge hardware that cannot host the models in the 7B parametric vision language of today. This letter presents TinyRS, the first 2B-parameter vision language models (VLMs) optimized for RS, and TinyRS-R1, its reasoning-augmented variant. Based on Qwen2-VL-2B, TinyRS is trained via a four-stage pipeline: pretraining on million-scale satellite images, instruction tuning, fine-tuning with chain-of-thought (CoT) annotations from a new reasoning dataset, and group relative policy optimization (GRPO)-based alignment. TinyRS-R1 matches or surpasses recent 7B RS models in classification, visual question answering (VQA), grounding, and open-ended QA—while using one third of the memory and latency. CoT reasoning improves grounding and scene understanding, while TinyRS excels at concise, low-latency VQA. TinyRS-R1 is the first domain-specialized small VLM with GRPO-aligned CoT reasoning for general-purpose RS. The code, models, and caption datasets are available at <uri>https://github.com/aybora/TinyRS</uri>","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-13DOI: 10.1109/LGRS.2025.3620872
Jingfan Wang;Wen Lu;Zeming Zhang;Zhaoyang Wang;Zhe Li
Transformer-based methods for remote sensing image super-resolution (SR) face challenges in reconstructing high-frequency textures due to the interference from large flat regions, such as farmlands and water bodies. To address these limitations, we propose a channel-enhanced multiscale window attention mechanism, which is designed to minimize the impact of flat regions on high-frequency area reconstruction while effectively utilizing the intrinsic multiscale features of remote sensing images. To better capture the multiscale features of remote sensing images, we introduce a series of depthwise separable convolution kernels of varying sizes during the shallow feature extraction stage. Experimental results demonstrate that the proposed method achieves superior peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) scores across multiple remote sensing benchmark datasets and scaling factors, validating its effectiveness.
{"title":"Multiscale Window Attention Channel Enhanced for Remote Sensing Image Super-Resolution","authors":"Jingfan Wang;Wen Lu;Zeming Zhang;Zhaoyang Wang;Zhe Li","doi":"10.1109/LGRS.2025.3620872","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3620872","url":null,"abstract":"Transformer-based methods for remote sensing image super-resolution (SR) face challenges in reconstructing high-frequency textures due to the interference from large flat regions, such as farmlands and water bodies. To address these limitations, we propose a channel-enhanced multiscale window attention mechanism, which is designed to minimize the impact of flat regions on high-frequency area reconstruction while effectively utilizing the intrinsic multiscale features of remote sensing images. To better capture the multiscale features of remote sensing images, we introduce a series of depthwise separable convolution kernels of varying sizes during the shallow feature extraction stage. Experimental results demonstrate that the proposed method achieves superior peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) scores across multiple remote sensing benchmark datasets and scaling factors, validating its effectiveness.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"23 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pretrained vision–language models (VLMs) have demonstrated promising performance in remote sensing (RS) image–text retrieval tasks. However, the scarcity of high-quality image–text datasets remains a challenge in fine-tuning VLMs for RS. The captions in existing datasets tend to be uniform and lack details. To fully use rich detailed information from RS images, we propose a method to fine-tune VLMs. We first construct a new visual–language dataset that balances both global and local information for RS (GLRS) image–text retrieval. Specifically, a multimodal large language model (MLLM) is used to generate captions for local patches and global captions for the entire image. To effectively use local information, we propose a global and local image captioning method (GLCap). With a large language model (LLM), we further obtain higher quality captions by merging both global and local captions. Finally, we fine-tune the weights of RS-M-contrastive language image pretraining (CLIP) with a progressive global–local fine-tuning strategy on GLRS. Experimental results demonstrate that our method outperforms state-of-the-art (SoTA) approaches on two common RS image–text retrieval downstream tasks. Our code and dataset are available at https://github.com/hhu-czy/GLRS
预训练的视觉语言模型(VLMs)在遥感图像文本检索任务中表现出了良好的性能。然而,高质量的图像-文本数据集的缺乏仍然是对遥感vlm进行微调的一个挑战,现有数据集的标题往往是统一的,缺乏细节。为了充分利用RS图像中丰富的细节信息,我们提出了一种微调VLMs的方法。我们首先构建了一个新的视觉语言数据集,该数据集平衡了RS (GLRS)图像文本检索的全局和局部信息。具体而言,使用多模态大语言模型(multimodal large language model, MLLM)生成局部补丁的标题和整个图像的全局标题。为了有效地利用局部信息,我们提出了一种全局和局部图像字幕方法(GLCap)。使用大型语言模型(LLM),我们通过合并全局和局部字幕进一步获得更高质量的字幕。最后,我们在GLRS上采用渐进的全局-局部微调策略对rs - m对比语言图像预训练(CLIP)的权重进行微调。实验结果表明,我们的方法在两个常见的RS图像文本检索下游任务上优于最先进的(SoTA)方法。我们的代码和数据集可在https://github.com/hhu-czy/GLRS上获得
{"title":"Integrating Global and Local Information for Remote Sensing Image–Text Retrieval","authors":"Ziyun Chen;Fan Liu;Zhangqingyun Guan;Qian Zhou;Xiaocong Zhou;Chuanyi Zhang","doi":"10.1109/LGRS.2025.3616154","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3616154","url":null,"abstract":"Pretrained vision–language models (VLMs) have demonstrated promising performance in remote sensing (RS) image–text retrieval tasks. However, the scarcity of high-quality image–text datasets remains a challenge in fine-tuning VLMs for RS. The captions in existing datasets tend to be uniform and lack details. To fully use rich detailed information from RS images, we propose a method to fine-tune VLMs. We first construct a new visual–language dataset that balances both global and local information for RS (GLRS) image–text retrieval. Specifically, a multimodal large language model (MLLM) is used to generate captions for local patches and global captions for the entire image. To effectively use local information, we propose a global and local image captioning method (GLCap). With a large language model (LLM), we further obtain higher quality captions by merging both global and local captions. Finally, we fine-tune the weights of RS-M-contrastive language image pretraining (CLIP) with a progressive global–local fine-tuning strategy on GLRS. Experimental results demonstrate that our method outperforms state-of-the-art (SoTA) approaches on two common RS image–text retrieval downstream tasks. Our code and dataset are available at <uri>https://github.com/hhu-czy/GLRS</uri>","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1109/LGRS.2025.3609444
Li Liu;Yongcheng Zhou;Hang Xu;Jingxia Li;Jianguo Zhang;Lijun Zhou;Bingjie Wang
Automatic underground object classification based on deep learning (DL) has been widely used in ground penetrating radar (GPR) fields. However, its excellent performance heavily depends on sufficient labeled training data. In GPR fields, large amounts of labeled data are difficult to obtain due to time-consuming and experience-dependent manual annotation work. To address the issue of limited labeled data, we propose a novel semi-supervised learning (SSL) method for urban-road underground multiclass object classification. It fully utilizes abundant unlabeled data and limited labeled data to enhance classification performance. We applied a variant of the triple-GAN (TGAN) model and modified it by introducing a similarity constraint, which is associated with GPR data geometric features and can help to produce high-quality generated images. Experimental results of laboratory and field data show that it has higher accuracy than representative baseline methods under limited labeled data.
{"title":"Semi-Supervised Triple-GAN With Similarity Constraint for Automatic Underground Object Classification Using Ground Penetrating Radar Data","authors":"Li Liu;Yongcheng Zhou;Hang Xu;Jingxia Li;Jianguo Zhang;Lijun Zhou;Bingjie Wang","doi":"10.1109/LGRS.2025.3609444","DOIUrl":"https://doi.org/10.1109/LGRS.2025.3609444","url":null,"abstract":"Automatic underground object classification based on deep learning (DL) has been widely used in ground penetrating radar (GPR) fields. However, its excellent performance heavily depends on sufficient labeled training data. In GPR fields, large amounts of labeled data are difficult to obtain due to time-consuming and experience-dependent manual annotation work. To address the issue of limited labeled data, we propose a novel semi-supervised learning (SSL) method for urban-road underground multiclass object classification. It fully utilizes abundant unlabeled data and limited labeled data to enhance classification performance. We applied a variant of the triple-GAN (TGAN) model and modified it by introducing a similarity constraint, which is associated with GPR data geometric features and can help to produce high-quality generated images. Experimental results of laboratory and field data show that it has higher accuracy than representative baseline methods under limited labeled data.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145078645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}