CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-12-25 DOI:10.1109/TIP.2024.3512354

Shuai Zhao;Ruijie Quan;Linchao Zhu;Yi Yang

{"title":"CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model","authors":"Shuai Zhao;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3512354","DOIUrl":null,"url":null,"abstract":"Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6893-6904"},"PeriodicalIF":13.7000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10816351/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CLIP4STR：一个简单的基线场景文本识别与预训练的视觉语言模型

预训练的视觉语言模型（vlm）是各种下游任务的事实上的基础模型。然而，场景文本识别方法仍然倾向于在单一模态（即视觉模态）上预训练的主干，尽管vlm具有作为强大的场景文本阅读器的潜力。例如，CLIP可以健壮地识别图像中的规则（水平）和不规则（旋转、弯曲、模糊或遮挡）文本。利用这些优点，我们将CLIP转换成一个场景文本阅读器，并介绍了基于CLIP图像和文本编码器的简单有效的STR方法CLIP4STR。它有两个编码器-解码器分支：可视分支和跨模态分支。视觉分支提供基于视觉特征的初始预测，跨模态分支通过解决视觉特征和文本语义之间的差异来细化该预测。为了充分利用这两个分支的功能，我们设计了一个用于推理的双重预测和优化解码方案。我们根据模型大小、预训练数据和训练数据对CLIP4STR进行缩放，在13个STR基准上实现了最先进的性能。此外，本文还提供了一项全面的实证研究，以增强对CLIP对STR适应的理解。我们的方法为未来的VLMs STR研究建立了一个简单而有力的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量