Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus

C. Wick, Christian Reul, F. Puppe
{"title":"Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus","authors":"C. Wick, Christian Reul, F. Puppe","doi":"10.21248/jlcl.33.2018.219","DOIUrl":null,"url":null,"abstract":"This paper proposes a combination of a convolutional and an LSTM network to improve the accuracy of OCR on early printed books. While the default approach of line based OCR is to use a single LSTM layer as provided by the well-established OCR software OCRopus (OCRopy), we utilize a CNN-and Pooling-Layer combination in advance of an LSTM layer as implemented by the novel OCR software Calamari. Since historical prints often require book speci fi c models trained on manually labeled ground truth (GT) the goal is to maximize the recognition accuracy of a trained model while keeping the needed manual e ff ort to a minimum. We show, that the deep model signi fi cantly outperforms the shallow LSTM network when using both many and only a few training examples, although the deep network has a higher amount of trainable parameters. Hereby, the error rate is reduced by a factor of up to 55%, yielding character error rates (CER) of 1% and below for 1,000 lines of training. To further improve the results, we apply a con fi dence voting mechanism to achieve CERs below 0 . 5%. A simple data augmentation scheme and the usage of pretrained models reduces the CER further by up to 62% if only few training data is available. Thus, we require only 100 lines of GT to reach an average CER of 1.2%. The runtime of the deep model for training and prediction of a book behaves very similar to a shallow network when trained on a CPU. However, the usage of a GPU, as supported by Calamari, reduces the prediction time by a factor of at least four and the training time by more than six.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.33.2018.219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

Abstract

This paper proposes a combination of a convolutional and an LSTM network to improve the accuracy of OCR on early printed books. While the default approach of line based OCR is to use a single LSTM layer as provided by the well-established OCR software OCRopus (OCRopy), we utilize a CNN-and Pooling-Layer combination in advance of an LSTM layer as implemented by the novel OCR software Calamari. Since historical prints often require book speci fi c models trained on manually labeled ground truth (GT) the goal is to maximize the recognition accuracy of a trained model while keeping the needed manual e ff ort to a minimum. We show, that the deep model signi fi cantly outperforms the shallow LSTM network when using both many and only a few training examples, although the deep network has a higher amount of trainable parameters. Hereby, the error rate is reduced by a factor of up to 55%, yielding character error rates (CER) of 1% and below for 1,000 lines of training. To further improve the results, we apply a con fi dence voting mechanism to achieve CERs below 0 . 5%. A simple data augmentation scheme and the usage of pretrained models reduces the CER further by up to 62% if only few training data is available. Thus, we require only 100 lines of GT to reach an average CER of 1.2%. The runtime of the deep model for training and prediction of a book behaves very similar to a shallow network when trained on a CPU. However, the usage of a GPU, as supported by Calamari, reduces the prediction time by a factor of at least four and the training time by more than six.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用开源引擎Calamari和OCRopus对早期印刷书籍OCR准确率的比较
本文提出了一种卷积网络和LSTM网络相结合的方法来提高早期印刷书籍OCR的准确率。虽然基于线的OCR的默认方法是使用由成熟的OCR软件OCRopus (OCRopy)提供的单个LSTM层,但我们在使用新颖的OCR软件Calamari实现的LSTM层之前使用cnn和池层组合。由于历史印刷品通常需要在人工标记的ground truth (GT)上训练特定于书籍的模型,因此目标是最大化训练模型的识别准确性,同时将所需的人工工作量降至最低。我们表明,尽管深度网络具有更多的可训练参数,但当使用大量或仅使用少量训练样本时,深度模型明显优于浅层LSTM网络。因此,错误率降低了55%,对于1000行训练,字符错误率(CER)为1%或以下。为了进一步改善结果,我们应用了信任投票机制来实现cer低于0。5%。如果只有很少的训练数据可用,一个简单的数据增强方案和预训练模型的使用可以进一步降低高达62%的CER。因此,我们只需要100行GT就可以达到1.2%的平均CER。用于训练和预测一本书的深度模型的运行时间与在CPU上训练的浅网络非常相似。然而,使用由Calamari支持的GPU,将预测时间减少了至少四倍,训练时间减少了六倍以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Aufbau eines Referenzkorpus zur deutschsprachigen internetbasierten Kommunikation als Zusatzkomponente für die Korpora im Projekt 'Digitales Wörterbuch der deutschen Sprache' (DWDS) Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1