Double supervision for scene text detection and recognition based on BMINet

IF 2.7 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Signal Processing-Image Communication Pub Date : 2025-01-01 Epub Date: 2024-10-26 DOI:10.1016/j.image.2024.117226

Hanyang Wan, Ruoyun Liu, Li Yu

{"title":"Double supervision for scene text detection and recognition based on BMINet","authors":"Hanyang Wan, Ruoyun Liu, Li Yu","doi":"10.1016/j.image.2024.117226","DOIUrl":null,"url":null,"abstract":"<div><div>Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117226"},"PeriodicalIF":2.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596524001279","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/26 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于 BMINet 的场景文本检测和识别双重监督

场景文本检测和识别目前是计算机视觉领域的重要研究领域，在智能驾驶和自动化生产等领域有着广泛的潜在应用。然而，现有的主流方法都存在明显的缺陷，包括文本区域检测不完整、背景噪声过大以及忽略全局信息和上下文相关性。在本研究中，我们介绍了基于边界拟合的创新性场景文本检测方法 BMINet，以及结合文本区域校正的双重监督场景文本识别方法。BMINet 框架主要由边界拟合模块和多尺度融合模块组成。边界拟合模块沿着预测的边界等距采样特定数量的控制点，并调整它们的位置，使检测框与文本形状更好地对齐。多尺度融合模块整合来自多尺度特征图的信息，以扩大网络的感受野。结合文本区域校正的双重监督场景文本识别方法集成了旋转矩形框和二值图像分割的图像处理模块。此外，它还引入了一个校正网络来细化文本区域边界。该方法整合了基于 CTC 损失和注意力机制的识别技术，强调文本图像中的纹理细节和上下文依赖关系，通过双重监督提高网络性能。广泛的消融和对比实验证实了两阶段模型在实现稳健检测和识别结果方面的功效，在 Total-Text 数据集上实现了 80.6% 的识别准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.