Choice of recognizable units for URDU OCR

DAR '12 Pub Date : 2012-12-16 DOI:10.1145/2432553.2432569
Gurpreet Singh Lehal
{"title":"Choice of recognizable units for URDU OCR","authors":"Gurpreet Singh Lehal","doi":"10.1145/2432553.2432569","DOIUrl":null,"url":null,"abstract":"There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"101 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DAR '12","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2432553.2432569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
乌尔都OCR可识别单位的选择
在阿拉伯语OCR方面已经有相当多的工作。然而,所有这些工作都是基于Naskh风格。乌尔都语以阿拉伯字母为基础,但使用纳斯塔利克字体。Nastalique风格使得OCR,特别是字符分割成为一项极具挑战性的任务,因此大多数研究者都避开了字符分割阶段,而转向更高的识别单元。对于乌尔都语,研究人员考虑的下一个更高的识别单位是词缀,它位于字符和单词之间。一个连词是一个或多个字符的连接组成部分,通常一个乌尔都语单词由1到8个连词组成。一个相关的问题是为了识别目的而识别所有可能的结扎。为此,我们对乌尔都语语料库进行了统计分析,收集和整理乌尔都语结合力。独特的结扎数量超过了26,000个,识别这样一个庞大的班级也是一项艰巨的任务。因此有必要减少类数并寻找替代的识别单元。从OCR的角度来看,连接可以进一步分割为一个主连接组件和零个或多个次连接组件。主要成分代表结扎的基本形状,而次要连接成分对应于与结扎相关的点和变音符标记和特殊符号。为了减少类数,将具有相似主组件的连接组合在一起。通过进一步的统计分析,对主要成分进行计数和降序排列,生成了约2300个可管理的识别单元,覆盖了99%的乌尔都语语料库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Model based table cell detection and content extraction from degraded document images Assamese online handwritten digit recognition system using hidden Markov models On performance analysis of end-to-end OCR systems of Indic scripts A data acquisition and analysis system for palm leaf documents in Telugu Bangla date field extraction in offline handwritten documents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1