Hybrid Training Data for Historical Text OCR

J. Martínek, Ladislav Lenc, P. Král, Anguelos Nicolaou, V. Christlein
{"title":"Hybrid Training Data for Historical Text OCR","authors":"J. Martínek, Ladislav Lenc, P. Král, Anguelos Nicolaou, V. Christlein","doi":"10.1109/ICDAR.2019.00096","DOIUrl":null,"url":null,"abstract":"Current optical character recognition (OCR) systems commonly make use of recurrent neural networks (RNN) that process whole text lines. Such systems avoid the task of character segmentation necessary for character-based approaches. A disadvantage of this approach is a need of a large amount of annotated data. This can be solved by sing generated synthetic data instead of costly manually annotated ones. Unfortunately, such data is often not suitable for historical documents particularly for quality reasons. This work presents a hybrid approach for generating annotated data for OCR at a low cost. We first collect a small dataset of isolated characters from historical document images. Then, we generate historical looking text lines from the generated characters. Another contribution lies in the design and implementation of an OCR system based on a convolutional-LSTM network. We first pre-train this system on hybrid data. Afterwards, the network is fine-tuned with real printed text lines. We demonstrate that this training strategy is efficient for obtaining state-of-the-art results. We also show that the score of the proposed system is comparable or even better in comparison to several state-of-the-art systems.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Current optical character recognition (OCR) systems commonly make use of recurrent neural networks (RNN) that process whole text lines. Such systems avoid the task of character segmentation necessary for character-based approaches. A disadvantage of this approach is a need of a large amount of annotated data. This can be solved by sing generated synthetic data instead of costly manually annotated ones. Unfortunately, such data is often not suitable for historical documents particularly for quality reasons. This work presents a hybrid approach for generating annotated data for OCR at a low cost. We first collect a small dataset of isolated characters from historical document images. Then, we generate historical looking text lines from the generated characters. Another contribution lies in the design and implementation of an OCR system based on a convolutional-LSTM network. We first pre-train this system on hybrid data. Afterwards, the network is fine-tuned with real printed text lines. We demonstrate that this training strategy is efficient for obtaining state-of-the-art results. We also show that the score of the proposed system is comparable or even better in comparison to several state-of-the-art systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
历史文本OCR混合训练数据
当前的光学字符识别(OCR)系统通常使用递归神经网络(RNN)来处理整行文本。这样的系统避免了基于字符的方法所必需的字符分割任务。这种方法的缺点是需要大量带注释的数据。这可以通过生成合成数据而不是昂贵的手工注释数据来解决。不幸的是,由于质量原因,这些数据通常不适合用于历史文档。本文提出了一种低成本生成OCR注释数据的混合方法。我们首先从历史文档图像中收集孤立字符的小数据集。然后,我们从生成的字符中生成具有历史外观的文本行。另一个贡献在于基于卷积- lstm网络的OCR系统的设计和实现。我们首先在混合数据上预训练这个系统。然后,使用真实打印的文本行对网络进行微调。我们证明了这种训练策略对于获得最先进的结果是有效的。我们还表明,与几个最先进的系统相比,所提出的系统的分数是相当的,甚至更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Article Segmentation in Digitised Newspapers with a 2D Markov Model ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images DICE: Deep Intelligent Contextual Embedding for Twitter Sentiment Analysis Blind Source Separation Based Framework for Multispectral Document Images Binarization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1