QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

Felix Stahlberg, S. Vogel
{"title":"QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries","authors":"Felix Stahlberg, S. Vogel","doi":"10.1109/DAS.2016.81","DOIUrl":null,"url":null,"abstract":"Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
图书馆阿拉伯文化遗产光学字符识别系统QATIP
如今,商用光学字符识别(OCR)软件在现代阿拉伯语文档的高质量扫描上实现了非常高的准确性。然而,图书馆中的大部分阿拉伯文化遗产收藏通常更具挑战性-例如,由打字文件,早期印刷品和历史手稿组成。在本文中,我们提出了一个面向终端用户的QATIP系统,用于这些文档的OCR。该识别基于Kaldi工具包和复杂的文本图像规范化。本文包含两个主要贡献:首先,我们描述了库的QATIP接口,该接口由用于添加和监视作业的图形用户界面和用于自动访问的web API组成。其次,我们提出了针对连续阿拉伯语OCR的语言建模和结扎建模的新方法。我们在早期印刷品和历史手稿上测试了我们的QATIP系统,并报告了实质性的改进-例如,QATIP的字符错误率为12.6%,而我们实验设置(Tesseract)中最好的OCR产品的错误率为51.8%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Handwritten and Machine-Printed Text Discrimination Using a Template Matching Approach General Pattern Run-Length Transform for Writer Identification Automatic Selection of Parameters for Document Image Enhancement Using Image Quality Assessment Large Scale Continuous Dating of Medieval Scribes Using a Combined Image and Language Model Performance of an Off-Line Signature Verification Method Based on Texture Features on a Large Indic-Script Signature Dataset
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1