{"title":"Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition","authors":"Fanfu Xue;Jiande Sun;Yaqi Xue;Qiang Wu;Lei Zhu;Xiaojun Chang;Sen-Ching Cheung","doi":"10.1109/TIP.2024.3523799","DOIUrl":null,"url":null,"abstract":"Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at <uri>https://github.com/xuefanfu/ACDS-STR</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"717-728"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10838318/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at https://github.com/xuefanfu/ACDS-STR.