Scene Text Image Superresolution Through Multiscale Interaction of Structural and Semantic Priors

IEEE transactions on artificial intelligence Pub Date : 2024-03-19 DOI:10.1109/TAI.2024.3375836

Zhongjie Zhu;Lei Zhang;Yongqiang Bai;Yuer Wang;Pei Li

{"title":"Scene Text Image Superresolution Through Multiscale Interaction of Structural and Semantic Priors","authors":"Zhongjie Zhu;Lei Zhang;Yongqiang Bai;Yuer Wang;Pei Li","doi":"10.1109/TAI.2024.3375836","DOIUrl":null,"url":null,"abstract":"Scene text image superresolution (STISR) aims to enhance the resolution of images containing text within a scene, making the text more readable and easier to recognize. This technique has broad applications in numerous fields such as autonomous driving, document scanning, image retrieval, and so on. However, most existing STISR methods have not fully exploited the multiscale structural and semantic information within scene text images. As a result, the restored text image quality is not sufficient, significantly impacting subsequent tasks such as text detection and recognition. Hence, this article proposes a novel scheme that leverages multiscale structural and semantic priors to efficiently guide text semantic restoration, ultimately yielding high-quality text images. First, a multiscale interaction attention (MSIA) module is designed to capture location-specific details of various-scale structural features and facilitate the recovery of semantic information. Second, a multiscale prior learning module (MSPLM) is developed. Within this module, skip connections are employed among codecs to strengthen both structural and semantic prior features, thereby enhancing the up-sampling and reconstruction capabilities. Finally, building upon the MSPLM, cascaded encoders are connected through residual connections to further enrich the multiscale features and bolster the representational capacity of the prior. Experiments conducted on the standard TextZoom dataset demonstrate that the average recognition accuracies of three evaluators—attentional scene text recognizer (ASTER), convolutional recurrent neural network (CRNN), and multi-object rectified attention network (MORAN)—are 64.4%, 53.5%, and 60.8%, respectively, surpassing most existing methods, including the state-of-the-art ones.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 7","pages":"3653-3663"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10473520/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Scene text image superresolution (STISR) aims to enhance the resolution of images containing text within a scene, making the text more readable and easier to recognize. This technique has broad applications in numerous fields such as autonomous driving, document scanning, image retrieval, and so on. However, most existing STISR methods have not fully exploited the multiscale structural and semantic information within scene text images. As a result, the restored text image quality is not sufficient, significantly impacting subsequent tasks such as text detection and recognition. Hence, this article proposes a novel scheme that leverages multiscale structural and semantic priors to efficiently guide text semantic restoration, ultimately yielding high-quality text images. First, a multiscale interaction attention (MSIA) module is designed to capture location-specific details of various-scale structural features and facilitate the recovery of semantic information. Second, a multiscale prior learning module (MSPLM) is developed. Within this module, skip connections are employed among codecs to strengthen both structural and semantic prior features, thereby enhancing the up-sampling and reconstruction capabilities. Finally, building upon the MSPLM, cascaded encoders are connected through residual connections to further enrich the multiscale features and bolster the representational capacity of the prior. Experiments conducted on the standard TextZoom dataset demonstrate that the average recognition accuracies of three evaluators—attentional scene text recognizer (ASTER), convolutional recurrent neural network (CRNN), and multi-object rectified attention network (MORAN)—are 64.4%, 53.5%, and 60.8%, respectively, surpassing most existing methods, including the state-of-the-art ones.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过结构先验和语义先验的多尺度交互实现场景文本图像超分辨率

场景文本图像超分辨率（STISR）旨在增强场景中包含文本的图像的分辨率，使文本更易读、更易识别。这项技术在自动驾驶、文档扫描、图像检索等众多领域有着广泛的应用。然而，现有的 STISR 方法大多没有充分利用场景文本图像中的多尺度结构和语义信息。因此，修复后的文本图像质量不高，严重影响了文本检测和识别等后续任务。因此，本文提出了一种新方案，利用多尺度结构和语义先验来有效指导文本语义还原，最终获得高质量的文本图像。首先，设计了一个多尺度交互注意（MSIA）模块，以捕捉不同尺度结构特征的特定位置细节，促进语义信息的恢复。其次，开发了多尺度先验学习模块（MSPLM）。在该模块中，编解码器之间采用跳转连接，以加强结构和语义先验特征，从而增强上采样和重建能力。最后，在 MSPLM 的基础上，通过残差连接将级联编码器连接起来，进一步丰富多尺度特征，增强先验的表征能力。在标准 TextZoom 数据集上进行的实验表明，三个评估器--注意力场景文本识别器（ASTER）、卷积递归神经网络（CRNN）和多对象整流注意力网络（MORAN）--的平均识别准确率分别为 64.4%、53.5% 和 60.8%，超过了大多数现有方法，包括最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊