指导场景文本识别

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-03 DOI:10.1109/TPAMI.2025.3525526

Yongkun Du;Zhineng Chen;Yuchen Su;Caiyan Jia;Yu-Gang Jiang

{"title":"指导场景文本识别","authors":"Yongkun Du;Zhineng Chen;Yuchen Su;Caiyan Jia;Yu-Gang Jiang","doi":"10.1109/TPAMI.2025.3525526","DOIUrl":null,"url":null,"abstract":"Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises <inline-formula><tex-math>$\\left\\langle condition,question,answer\\right\\rangle$</tex-math></inline-formula> instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2723-2738"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Instruction-Guided Scene Text Recognition\",\"authors\":\"Yongkun Du;Zhineng Chen;Yuchen Su;Caiyan Jia;Yu-Gang Jiang\",\"doi\":\"10.1109/TPAMI.2025.3525526\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises <inline-formula><tex-math>$\\\\left\\\\langle condition,question,answer\\\\right\\\\rangle$</tex-math></inline-formula> instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 4\",\"pages\":\"2723-2738\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10820836/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10820836/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多模态模型在视觉识别任务中表现出了吸引人的性能，因为自由形式的文本引导训练唤起了理解细粒度视觉内容的能力。然而，由于自然图像和文本图像在成分上的差异，目前的模型还不能很好地应用于场景文本识别。我们提出了一种新的指令引导场景文本识别（IGTR）范式，该范式将STR定义为一个指令学习问题，并通过预测字符属性（如字符频率、位置等）来理解文本图像。IGTR首先设计了$\left\ rangle条件、$ question、$ answer\right\rangle指令三元组，提供了丰富多样的字符属性描述。为了通过问答有效地学习这些属性，IGTR开发了一个轻量级指令编码器、一个跨模态特征融合模块和一个多任务回答头，以指导细微的文本图像理解。此外，IGTR通过使用不同的指令实现了不同的识别管道，从而实现了与当前方法有很大不同的基于字符理解的文本推理范式。在英语和中文基准上的实验表明，IGTR在保持小模型尺寸和快速推理速度的同时，显著优于现有模型。此外，通过调整指令的采样，IGTR提供了一种优雅的方法来解决以前的挑战，即很少出现和形态相似的字符的识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Instruction-Guided Scene Text Recognition

Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises

$\left\langle condition,question,answer\right\rangle$

instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量