Glyph miner: A system for efficiently extracting glyphs from early prints in the context of OCR

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL) Pub Date : 2016-06-19 DOI:10.1145/2910896.2910915

B. Budig, Thomas C. van Dijk, F. Kirchner

引用次数: 6

Abstract

While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be “trained” specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

象形文字挖掘器:一种在OCR环境中有效地从早期印刷中提取象形文字的系统

虽然现成的OCR系统在许多现代文档上工作得很好，但早期打印的异质性提供了一个重大挑战。为了获得良好的识别质量，现有的软件必须针对每个特定的语料库进行专门的“训练”。这是一个繁琐的过程，需要大量的用户工作。在本文中，我们演示了一个系统，该系统通常用更有效的工作流程取代训练管道的公共部分:给定一组历史文档的扫描页面，我们的系统使用有效的用户交互来半自动地提取用户指示的大量出现的字形。在一个初步的案例研究中，我们通过将我们的系统嵌入到 rzburg大学图书馆的工作流程中来评估我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量