Choice of recognizable units for URDU OCR

DAR '12 Pub Date : 2012-12-16 DOI:10.1145/2432553.2432569

Gurpreet Singh Lehal

{"title":"Choice of recognizable units for URDU OCR","authors":"Gurpreet Singh Lehal","doi":"10.1145/2432553.2432569","DOIUrl":null,"url":null,"abstract":"There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"101 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DAR '12","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2432553.2432569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

乌尔都OCR可识别单位的选择

在阿拉伯语OCR方面已经有相当多的工作。然而，所有这些工作都是基于Naskh风格。乌尔都语以阿拉伯字母为基础，但使用纳斯塔利克字体。Nastalique风格使得OCR，特别是字符分割成为一项极具挑战性的任务，因此大多数研究者都避开了字符分割阶段，而转向更高的识别单元。对于乌尔都语，研究人员考虑的下一个更高的识别单位是词缀，它位于字符和单词之间。一个连词是一个或多个字符的连接组成部分，通常一个乌尔都语单词由1到8个连词组成。一个相关的问题是为了识别目的而识别所有可能的结扎。为此，我们对乌尔都语语料库进行了统计分析，收集和整理乌尔都语结合力。独特的结扎数量超过了26,000个，识别这样一个庞大的班级也是一项艰巨的任务。因此有必要减少类数并寻找替代的识别单元。从OCR的角度来看，连接可以进一步分割为一个主连接组件和零个或多个次连接组件。主要成分代表结扎的基本形状，而次要连接成分对应于与结扎相关的点和变音符标记和特殊符号。为了减少类数，将具有相似主组件的连接组合在一起。通过进一步的统计分析，对主要成分进行计数和降序排列，生成了约2300个可管理的识别单元，覆盖了99%的乌尔都语语料库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

DAR '12

自引率

0.00%

发文量

期刊最新文献

Model based table cell detection and content extraction from degraded document images Assamese online handwritten digit recognition system using hidden Markov models On performance analysis of end-to-end OCR systems of Indic scripts A data acquisition and analysis system for palm leaf documents in Telugu Bangla date field extraction in offline handwritten documents