Optical character recognition with transformers and CTC

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI:10.1145/3558100.3563845

Israel Campiotti, R. Lotufo

{"title":"Optical character recognition with transformers and CTC","authors":"Israel Campiotti, R. Lotufo","doi":"10.1145/3558100.3563845","DOIUrl":null,"url":null,"abstract":"Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of the components of a CRNN to find what is crucial to the entire pipeline and what characteristics can be exchanged for a more effective choice. Given the results of our experiments, we propose two different architectures for the task of text recognition. The first model, CNN + CTC, is a combination of a convolutional model followed by a CTC layer. The second model, CNN + Tr + CTC, adds an encoder-only Transformers between the convolutional network and the CTC layer. To the best of our knowledge, this is the first time that a Transformers have been successfully trained using just CTC loss. To assess the capabilities of our proposed architectures, we train and evaluate them on the SROIE 2019 data set. Our CNN + CTC achieves an F1 score of 89.66% possessing only 4.7 million parameters. CNN + Tr + CTC attained an F1 score of 93.76% with 11 million parameters, which is almost 97% of the performance achieved by the TrOCR using 334 million parameters and more than 600 million synthetic images for pretraining.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3558100.3563845","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of the components of a CRNN to find what is crucial to the entire pipeline and what characteristics can be exchanged for a more effective choice. Given the results of our experiments, we propose two different architectures for the task of text recognition. The first model, CNN + CTC, is a combination of a convolutional model followed by a CTC layer. The second model, CNN + Tr + CTC, adds an encoder-only Transformers between the convolutional network and the CTC layer. To the best of our knowledge, this is the first time that a Transformers have been successfully trained using just CTC loss. To assess the capabilities of our proposed architectures, we train and evaluate them on the SROIE 2019 data set. Our CNN + CTC achieves an F1 score of 89.66% possessing only 4.7 million parameters. CNN + Tr + CTC attained an F1 score of 93.76% with 11 million parameters, which is almost 97% of the performance achieved by the TrOCR using 334 million parameters and more than 600 million synthetic images for pretraining.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

光学字符识别与变压器和CTC

文本识别任务通常通过使用称为CRNN的深度学习管道来解决。经典的CRNN是一个卷积网络序列，然后是一个双向LSTM和一个CTC层。在本文中，我们对CRNN的组成部分进行了广泛的分析，以找出对整个管道至关重要的内容，以及可以交换哪些特征以获得更有效的选择。根据我们的实验结果，我们提出了两种不同的文本识别架构。第一个模型是CNN + CTC，它是卷积模型和CTC层的组合。第二个模型，CNN + Tr + CTC，在卷积网络和CTC层之间增加了一个仅编码的变压器。据我们所知，这是第一次变形金刚成功地使用CTC损失进行训练。为了评估我们提出的架构的能力，我们在SROIE 2019数据集上对它们进行了训练和评估。我们的CNN + CTC仅使用470万个参数，F1得分达到89.66%。CNN + Tr + CTC使用1100万个参数获得了93.76%的F1分数，几乎是使用3.34亿个参数和6亿多张合成图像进行预训练的TrOCR的97%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 22nd ACM Symposium on Document Engineering

自引率

0.00%

发文量