HMM-based script identification for OCR

MOCR '13 Pub Date : 2013-08-24 DOI:10.1145/2505377.2505382
Dmitriy Genzel, Ashok Popat, R. Teunen, Yasuhisa Fujii
{"title":"HMM-based script identification for OCR","authors":"Dmitriy Genzel, Ashok Popat, R. Teunen, Yasuhisa Fujii","doi":"10.1145/2505377.2505382","DOIUrl":null,"url":null,"abstract":"While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system used for OCR to the task of script/language ID, by replacing character labels with script class labels. We apply it in a multi-pass overall OCR process which achieves \"universal\" OCR over 54 tested languages in 18 distinct scripts, over a wide variety of typefaces in each. For comparison we also consider a brute-force approach, wherein a singe HMM-based OCR system is trained to recognize all considered scripts. Results are presented on a large and diverse evaluation set extracted from book images, both for script identification accuracy and for overall OCR accuracy. On this evaluation data, the script ID system provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end OCR system with the script ID system achieved a character error rate of 4.05%, an increase of 0.77% over the case where the languages are known a priori.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MOCR '13","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505377.2505382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system used for OCR to the task of script/language ID, by replacing character labels with script class labels. We apply it in a multi-pass overall OCR process which achieves "universal" OCR over 54 tested languages in 18 distinct scripts, over a wide variety of typefaces in each. For comparison we also consider a brute-force approach, wherein a singe HMM-based OCR system is trained to recognize all considered scripts. Results are presented on a large and diverse evaluation set extracted from book images, both for script identification accuracy and for overall OCR accuracy. On this evaluation data, the script ID system provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end OCR system with the script ID system achieved a character error rate of 4.05%, an increase of 0.77% over the case where the languages are known a priori.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于OCR的基于hmm的脚本标识
虽然当前的OCR系统能够识别越来越多的脚本和语言的文本,但通常它们仍然需要事先被告知这些脚本和语言是什么。我们提出了一种方法,通过用脚本类标签替换字符标签,将用于OCR的相同基于hmm的系统重新用于脚本/语言ID的任务。我们将其应用于一个多通道整体OCR过程中,该过程在54种测试语言的18种不同的脚本中实现了“通用”OCR,每种语言都有各种各样的字体。为了比较,我们还考虑了一种蛮力方法,其中一个基于hmm的OCR系统被训练来识别所有考虑的脚本。从图书图像中提取了大量不同的评估集,结果显示了脚本识别的准确性和整体OCR的准确性。在该评价数据上,脚本ID系统对18个不同的脚本提供的脚本ID错误率为1.73%。使用脚本ID系统的端到端OCR系统实现了4.05%的字符错误率,比先验已知语言的情况增加了0.77%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Can we build language-independent OCR using LSTM networks? Recognition of offline handwritten numerals using an ensemble of MLPs combined by Adaboost Word level script recognition for Uighur document mixed with English script An approach for Bangla and Devanagari video text recognition HMM-based script identification for OCR
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1