Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

IF 2.1 3区地球科学 Q2 GEOGRAPHY Transactions in GIS Pub Date : 2024-02-24 DOI:10.1111/tgis.13146

Shangyou Wu, Wenhao Yu, Yifan Zhang, Mengqiu Huang

{"title":"Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images","authors":"Shangyou Wu, Wenhao Yu, Yifan Zhang, Mengqiu Huang","doi":"10.1111/tgis.13146","DOIUrl":null,"url":null,"abstract":"As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state‐of‐the‐art multimodal models pre‐trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at https://github.com/nwuSY/svtRetrieval.","PeriodicalId":47842,"journal":{"name":"Transactions in GIS","volume":"33 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions in GIS","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.1111/tgis.13146","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GEOGRAPHY","Score":null,"Total":0}

引用次数: 0

Abstract

As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state‐of‐the‐art multimodal models pre‐trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at https://github.com/nwuSY/svtRetrieval.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

仅利用图像数据进行多模态学习：通过融合图像的视觉和场景文本特征实现街景图像检索的深度无监督模型

作为信息检索的经典任务之一，图像检索的核心是识别与查询图像具有相似特征的图像，目的是使用户能够方便地从大量图像中找到所需的信息。尤其是街景图像检索，在很多领域都有广泛的应用，如改善导航和地图服务、制定城市发展规划方案、分析建筑物的历史演变等。然而，街景图像的前景和背景细节错综复杂，加上缺乏属性注释，使其成为实际应用中最具挑战性的问题之一。目前的图像检索研究主要使用完全依赖于图像视觉特征的视觉模型，以及需要额外数据源（如注释文本）的多模态学习模型。然而，创建有注释的数据集成本高昂，而街景图像本身包含大量场景文本，却往往没有注释。因此，本文提出了一种深度无监督学习算法，将图像数据中的视觉和文本特征结合起来，以提高街景图像检索的准确性。具体来说，我们采用文本检测算法来识别场景文本，利用金字塔字符直方图编码预测模型来提取图像中的文本信息，部署深度卷积神经网络来提取视觉特征，并结合对比学习模块来进行图像检索。通过对三个街景图像数据集的测试，结果表明我们的模型与在大量数据集上预先训练过的最先进的多模态模型相比具有一定的优势，其特点是参数更少、浮点运算更低。代码和数据可在 https://github.com/nwuSY/svtRetrieval 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Transactions in GIS GEOGRAPHY-

CiteScore

4.60

自引率

8.30%

发文量

116

期刊介绍： Transactions in GIS is an international journal which provides a forum for high quality, original research articles, review articles, short notes and book reviews that focus on: - practical and theoretical issues influencing the development of GIS - the collection, analysis, modelling, interpretation and display of spatial data within GIS - the connections between GIS and related technologies - new GIS applications which help to solve problems affecting the natural or built environments, or business