{"title":"Toward Efficient and Accurate Remote Sensing Image–Text Retrieval With a Coarse-to-Fine Approach","authors":"Wenqian Zhou;Hanlin Wu;Pei Deng","doi":"10.1109/LGRS.2024.3494543","DOIUrl":null,"url":null,"abstract":"Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a tradeoff between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a postprocessing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at \n<uri>https://github.com/ZhWenQian/CFITR</uri>\n.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10747393/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a tradeoff between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a postprocessing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at
https://github.com/ZhWenQian/CFITR
.