历史文本OCR混合训练数据

2019 International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2019-09-01 DOI:10.1109/ICDAR.2019.00096

J. Martínek, Ladislav Lenc, P. Král, Anguelos Nicolaou, V. Christlein

{"title":"历史文本OCR混合训练数据","authors":"J. Martínek, Ladislav Lenc, P. Král, Anguelos Nicolaou, V. Christlein","doi":"10.1109/ICDAR.2019.00096","DOIUrl":null,"url":null,"abstract":"Current optical character recognition (OCR) systems commonly make use of recurrent neural networks (RNN) that process whole text lines. Such systems avoid the task of character segmentation necessary for character-based approaches. A disadvantage of this approach is a need of a large amount of annotated data. This can be solved by sing generated synthetic data instead of costly manually annotated ones. Unfortunately, such data is often not suitable for historical documents particularly for quality reasons. This work presents a hybrid approach for generating annotated data for OCR at a low cost. We first collect a small dataset of isolated characters from historical document images. Then, we generate historical looking text lines from the generated characters. Another contribution lies in the design and implementation of an OCR system based on a convolutional-LSTM network. We first pre-train this system on hybrid data. Afterwards, the network is fine-tuned with real printed text lines. We demonstrate that this training strategy is efficient for obtaining state-of-the-art results. We also show that the score of the proposed system is comparable or even better in comparison to several state-of-the-art systems.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Hybrid Training Data for Historical Text OCR\",\"authors\":\"J. Martínek, Ladislav Lenc, P. Král, Anguelos Nicolaou, V. Christlein\",\"doi\":\"10.1109/ICDAR.2019.00096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current optical character recognition (OCR) systems commonly make use of recurrent neural networks (RNN) that process whole text lines. Such systems avoid the task of character segmentation necessary for character-based approaches. A disadvantage of this approach is a need of a large amount of annotated data. This can be solved by sing generated synthetic data instead of costly manually annotated ones. Unfortunately, such data is often not suitable for historical documents particularly for quality reasons. This work presents a hybrid approach for generating annotated data for OCR at a low cost. We first collect a small dataset of isolated characters from historical document images. Then, we generate historical looking text lines from the generated characters. Another contribution lies in the design and implementation of an OCR system based on a convolutional-LSTM network. We first pre-train this system on hybrid data. Afterwards, the network is fine-tuned with real printed text lines. We demonstrate that this training strategy is efficient for obtaining state-of-the-art results. We also show that the score of the proposed system is comparable or even better in comparison to several state-of-the-art systems.\",\"PeriodicalId\":325437,\"journal\":{\"name\":\"2019 International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2019.00096\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

当前的光学字符识别(OCR)系统通常使用递归神经网络(RNN)来处理整行文本。这样的系统避免了基于字符的方法所必需的字符分割任务。这种方法的缺点是需要大量带注释的数据。这可以通过生成合成数据而不是昂贵的手工注释数据来解决。不幸的是，由于质量原因，这些数据通常不适合用于历史文档。本文提出了一种低成本生成OCR注释数据的混合方法。我们首先从历史文档图像中收集孤立字符的小数据集。然后，我们从生成的字符中生成具有历史外观的文本行。另一个贡献在于基于卷积- lstm网络的OCR系统的设计和实现。我们首先在混合数据上预训练这个系统。然后，使用真实打印的文本行对网络进行微调。我们证明了这种训练策略对于获得最先进的结果是有效的。我们还表明，与几个最先进的系统相比，所提出的系统的分数是相当的，甚至更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hybrid Training Data for Historical Text OCR

Current optical character recognition (OCR) systems commonly make use of recurrent neural networks (RNN) that process whole text lines. Such systems avoid the task of character segmentation necessary for character-based approaches. A disadvantage of this approach is a need of a large amount of annotated data. This can be solved by sing generated synthetic data instead of costly manually annotated ones. Unfortunately, such data is often not suitable for historical documents particularly for quality reasons. This work presents a hybrid approach for generating annotated data for OCR at a low cost. We first collect a small dataset of isolated characters from historical document images. Then, we generate historical looking text lines from the generated characters. Another contribution lies in the design and implementation of an OCR system based on a convolutional-LSTM network. We first pre-train this system on hybrid data. Afterwards, the network is fine-tuned with real printed text lines. We demonstrate that this training strategy is efficient for obtaining state-of-the-art results. We also show that the score of the proposed system is comparable or even better in comparison to several state-of-the-art systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量