Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache
Christian Reul, S. Göttel, U. Springmann, C. Wick, Kay-Michael Würzner, F. Puppe
{"title":"Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache","authors":"Christian Reul, S. Göttel, U. Springmann, C. Wick, Kay-Michael Würzner, F. Puppe","doi":"10.1145/3322905.3322910","DOIUrl":null,"url":null,"abstract":"When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.
结合OCR和排版分类的历史词典自动语义文本标注——以Daniel Sander的Wörterbuch der Deutschen Sprache为例
在将历史词典转换为电子形式时,目标不仅是为文本获得高质量的OCR结果,而且还要对排版属性执行精确的自动识别,以便捕获逻辑结构。为此,我们提出了一种方法,通过在传统OCR和排版识别上训练开源OCR引擎来实现细粒度的排版分类,并展示了如何将获得的排版信息映射到OCR识别的文本输出。作为测试用例,我们使用了一本19世纪的德语词典(Sander的Wörterbuch der Deutschen Sprache),它包含了一个特别复杂的排版语义功能。尽管材料非常具有挑战性,但我们实现了字符错误率低于0.4%,排版识别为接近99%的单词分配了正确的标签。与许多现有方法相比,我们的新方法可以处理真实的历史数据,甚至可以在行内处理频繁的排版变化。