Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks

2019 International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2019-09-01 DOI:10.1109/ICDAR.2019.00011

Chris Tensmeyer, Curtis Wigington

{"title":"Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks","authors":"Chris Tensmeyer, Curtis Wigington","doi":"10.1109/ICDAR.2019.00011","DOIUrl":null,"url":null,"abstract":"Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

训练全页手写文本识别模型没有注释的换行符

训练手写文本识别(HTR)模型通常需要大量标记数据，这些数据通常是具有相应的行级地面真值(GT)转录的行或页图像。许多数字集合对每个图像都有页面级别的转录，但转录是未格式化的，即没有注释换行符。我们可以使用这些数据来训练基于线的HTR模型吗?在这项工作中，我们提出了一种新的对齐技术，用于在HTR模型训练期间将页面级GT文本分割为文本行。这个文本分割问题被制定为一个优化问题，以最小化对齐预测线与GT文本的成本。使用模拟和HTR模型预测，我们表明对齐方法可以准确地识别断行，即使预测的行具有高字符错误率(CER)。我们从ICDAR-2017 READ数据集中删除了GT换行，并使用提出的对齐方法训练了一个HTR模型来实时预测换行。该模型与使用GT换行符训练的相同模型实现了可比的CER w.r.t.。此外，我们下载了50K英文期刊页面的在线数字集合(不是为HTR研究设计的)，其转录不包含换行符，并达到11%的CER。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量