部分场景文本检索

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-11-19 DOI:10.1109/TPAMI.2024.3496576

Hao Wang;Minghui Liao;Zhouyi Xie;Wenyu Liu;Xiang Bai

{"title":"部分场景文本检索","authors":"Hao Wang;Minghui Liao;Zhouyi Xie;Wenyu Liu;Xiang Bai","doi":"10.1109/TPAMI.2024.3496576","DOIUrl":null,"url":null,"abstract":"The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. We evaluate the proposed method on both English and Chinese datasets in two tasks: retrieving text-line instances and partial patches. For English text retrieval, our method outperforms state-of-the-art approaches by 8.04% mAP and 12.71% mAP on average, respectively, among three datasets for the two tasks. For Chinese text retrieval, our approach surpasses state-of-the-art approaches by 24.45% mAP and 38.06% mAP on average, respectively, among three datasets for the two tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1548-1563"},"PeriodicalIF":18.6000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Partial Scene Text Retrieval\",\"authors\":\"Hao Wang;Minghui Liao;Zhouyi Xie;Wenyu Liu;Xiang Bai\",\"doi\":\"10.1109/TPAMI.2024.3496576\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. We evaluate the proposed method on both English and Chinese datasets in two tasks: retrieving text-line instances and partial patches. For English text retrieval, our method outperforms state-of-the-art approaches by 8.04% mAP and 12.71% mAP on average, respectively, among three datasets for the two tasks. For Chinese text retrieval, our approach surpasses state-of-the-art approaches by 24.45% mAP and 38.06% mAP on average, respectively, among three datasets for the two tasks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 3\",\"pages\":\"1548-1563\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10758313/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10758313/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

部分场景文本检索的任务涉及本地化和搜索与图片库中给定查询文本相同或相似的文本实例。然而，现有的方法只能处理文本行实例，由于训练数据中缺乏补丁注释，无法解决在这些文本行实例中搜索部分补丁的问题。为了解决这个问题，我们提出了一个可以同时检索文本行实例及其部分补丁的网络。我们的方法将两种类型的数据（查询文本和场景文本实例）嵌入到共享的特征空间中，并测量它们的跨模态相似性。为了处理部分补丁，我们提出的方法采用多实例学习（MIL）方法来学习它们与查询文本的相似性，而不需要额外的注释。然而，作为传统MIL方法的标准步骤，构造袋可能会引入大量的噪声样本用于训练，并降低推理速度。为了解决这个问题，我们提出了一种排序MIL （RankMIL）方法来自适应过滤这些有噪声的样本。此外，我们提出了一种动态部分匹配算法（DPMA），该算法可以在推理阶段直接从文本行实例中搜索目标部分补丁，而不需要包。这大大提高了搜索效率和检索部分补丁的性能。我们在检索文本行实例和部分补丁两个任务上对该方法进行了评估。对于英语文本检索，我们的方法在两个任务的三个数据集中平均比最先进的方法分别高出8.04% mAP和12.71% mAP。对于中文文本检索，我们的方法在两个任务的三个数据集上平均比现有的方法分别高出24.45%和38.06%的mAP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Partial Scene Text Retrieval

The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. We evaluate the proposed method on both English and Chinese datasets in two tasks: retrieving text-line instances and partial patches. For English text retrieval, our method outperforms state-of-the-art approaches by 8.04% mAP and 12.71% mAP on average, respectively, among three datasets for the two tasks. For Chinese text retrieval, our approach surpasses state-of-the-art approaches by 24.45% mAP and 38.06% mAP on average, respectively, among three datasets for the two tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Calibrating Biased Distribution in VFM-Derived Latent Space via Cross-Domain Geometric Consistency. Penny-Wise and Pound-Foolish in AI-Generated Image Detection. 50 Years of Automated Face Recognition. Soft Label Pruning and Quantization for Large-Scale Dataset Distillation. On the Adversarial Transferability of Generalized "Skip Connections".