A Synthetic Recipe for OCR

2019 International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2019-09-01 DOI:10.1109/ICDAR.2019.00143

David Etter, Stephen Rawls, Cameron Carpenter, Gregory Sell

引用次数: 16

Abstract

Synthetic data generation for optical character recognition (OCR) promises unlimited training data at zero annotation cost. With enough fonts and seed text, we should be able to generate data to train a model that approaches or exceeds the performance with real annotated data. Unfortunately, this is not always the reality. Unconstrained image settings, such as internet memes, scanned web pages, or newspapers, present diverse scripts, fonts, layouts, and complex backgrounds, which cause models trained with synthetic data to break down. In this work, we investigate the synthetic image generation problem on a large multilingual set of unconstrained document images. Our work presents a comprehensive evaluation of the impact of synthetic data attributes on model performance. The results provide a recipe for synthetic data generation that will help guide future research.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

OCR的合成配方

光学字符识别(OCR)的合成数据生成承诺以零标注成本获得无限的训练数据。有了足够的字体和种子文本，我们应该能够生成数据来训练一个接近或超过真实带注释数据性能的模型。不幸的是，事实并非总是如此。不受约束的图像设置，如网络表情包、扫描的网页或报纸，呈现出不同的脚本、字体、布局和复杂的背景，这导致用合成数据训练的模型崩溃。在这项工作中，我们研究了一个大型多语言无约束文档图像集上的合成图像生成问题。我们的工作提出了综合数据属性对模型性能影响的综合评估。这些结果为合成数据的生成提供了一个方法，将有助于指导未来的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量