Zhongjie Zhu;Lei Zhang;Yongqiang Bai;Yuer Wang;Pei Li
{"title":"Scene Text Image Superresolution Through Multiscale Interaction of Structural and Semantic Priors","authors":"Zhongjie Zhu;Lei Zhang;Yongqiang Bai;Yuer Wang;Pei Li","doi":"10.1109/TAI.2024.3375836","DOIUrl":null,"url":null,"abstract":"Scene text image superresolution (STISR) aims to enhance the resolution of images containing text within a scene, making the text more readable and easier to recognize. This technique has broad applications in numerous fields such as autonomous driving, document scanning, image retrieval, and so on. However, most existing STISR methods have not fully exploited the multiscale structural and semantic information within scene text images. As a result, the restored text image quality is not sufficient, significantly impacting subsequent tasks such as text detection and recognition. Hence, this article proposes a novel scheme that leverages multiscale structural and semantic priors to efficiently guide text semantic restoration, ultimately yielding high-quality text images. First, a multiscale interaction attention (MSIA) module is designed to capture location-specific details of various-scale structural features and facilitate the recovery of semantic information. Second, a multiscale prior learning module (MSPLM) is developed. Within this module, skip connections are employed among codecs to strengthen both structural and semantic prior features, thereby enhancing the up-sampling and reconstruction capabilities. Finally, building upon the MSPLM, cascaded encoders are connected through residual connections to further enrich the multiscale features and bolster the representational capacity of the prior. Experiments conducted on the standard TextZoom dataset demonstrate that the average recognition accuracies of three evaluators—attentional scene text recognizer (ASTER), convolutional recurrent neural network (CRNN), and multi-object rectified attention network (MORAN)—are 64.4%, 53.5%, and 60.8%, respectively, surpassing most existing methods, including the state-of-the-art ones.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10473520/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Scene text image superresolution (STISR) aims to enhance the resolution of images containing text within a scene, making the text more readable and easier to recognize. This technique has broad applications in numerous fields such as autonomous driving, document scanning, image retrieval, and so on. However, most existing STISR methods have not fully exploited the multiscale structural and semantic information within scene text images. As a result, the restored text image quality is not sufficient, significantly impacting subsequent tasks such as text detection and recognition. Hence, this article proposes a novel scheme that leverages multiscale structural and semantic priors to efficiently guide text semantic restoration, ultimately yielding high-quality text images. First, a multiscale interaction attention (MSIA) module is designed to capture location-specific details of various-scale structural features and facilitate the recovery of semantic information. Second, a multiscale prior learning module (MSPLM) is developed. Within this module, skip connections are employed among codecs to strengthen both structural and semantic prior features, thereby enhancing the up-sampling and reconstruction capabilities. Finally, building upon the MSPLM, cascaded encoders are connected through residual connections to further enrich the multiscale features and bolster the representational capacity of the prior. Experiments conducted on the standard TextZoom dataset demonstrate that the average recognition accuracies of three evaluators—attentional scene text recognizer (ASTER), convolutional recurrent neural network (CRNN), and multi-object rectified attention network (MORAN)—are 64.4%, 53.5%, and 60.8%, respectively, surpassing most existing methods, including the state-of-the-art ones.