DAR '12最新文献

英文中文

Performance analysis of feature extractors and classifiers for script recognition of English and Gurmukhi words 特征提取器和分类器在英语和古慕克语文字识别中的性能分析

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432559

Rajneesh Rani, R. Dhir, Gurpreet Singh Lehal

Script Recognition is a challenging field for the recognition of documents in a multilingual country like India where different scripts are in use. For optical character recognition of such multilingual documents, it is necessary to separate blocks, lines, words and characters of different scripts before feeding them to the OCRs of individual scripts. Many approaches have been proposed by the researchers towards script recognition at different levels (Block, Line, Word and Character Level). Normally Indian documents, in any its state language contain English words mixed with other words in its own state language. In this paper, we extract three different types of features: Structural, Gabor and Discrete Cosine Transforms(DCT) Features from Isolated English and Gurmukhi words and compare their script recognition performance using three different classifiers: Support Vector Machine (SVM), k-Nearest Neighbor and Parzen Probabilistic Neural Network (PNN).

对于像印度这样使用不同文字的多语言国家的文档识别来说，脚本识别是一个具有挑战性的领域。对于这类多语言文档的光学字符识别，需要对不同文字的块、行、词、字符进行分离，然后再输入到单个文字的ocr中。研究者们提出了许多不同层次(块、行、字、字)的文字识别方法。通常情况下，印度的文件，在其任何国家语言中，都包含英语单词和其他国家语言单词的混合。在本文中，我们从孤立的英语和Gurmukhi单词中提取了三种不同类型的特征:结构，Gabor和离散余弦变换(DCT)特征，并使用三种不同的分类器:支持向量机(SVM)， k-最近邻和Parzen概率神经网络(PNN)来比较它们的脚本识别性能。

引用次数: 7

Super-resolution of single text image by sparse representation 基于稀疏表示的单幅文本图像的超分辨率

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432558

Rim Walha, Fadoua Drira, Frank Lebourgeois, A. Alimi

This paper addresses the problem of generating a super-resolved text image from a single low-resolution image. The proposed Super-Resolution (SR) method is based on sparse coding which suggests that image patches can be well represented as a sparse linear combination of elements from a suitably chosen learned dictionary. Toward this strategy, a High-Resolution/Low-Resolution (HR/LR) patch pair data base is collected from high quality character images. To our knowledge, it is the first generic database allowing SR of text images may be contained in documents, signs, labels, bills, etc. This database is used to train jointly two dictionaries. The sparse representation of a LR image patch from the first dictionary can be applied to generate a HR image patch from the second dictionary. The performance of such approach is evaluated and compared visually and quantitatively to other existing SR methods applied to text images. In addition, we examine the influence of text image resolution on automatic recognition performance and we further justify the effectiveness of the proposed SR method compared to others.

本文解决了从单张低分辨率图像生成超分辨率文本图像的问题。提出的超分辨率(SR)方法基于稀疏编码，这表明图像补丁可以很好地表示为适当选择的学习字典中元素的稀疏线性组合。针对这一策略，从高质量的字符图像中收集高分辨率/低分辨率(HR/LR)补丁对数据库。据我们所知，这是第一个允许文本图像SR可以包含在文档，标志，标签，账单等的通用数据库。该数据库用于联合训练两个字典。来自第一个字典的LR图像补丁的稀疏表示可以应用于从第二个字典生成HR图像补丁。对这种方法的性能进行了评估，并在视觉上和定量上与应用于文本图像的其他现有SR方法进行了比较。此外，我们还研究了文本图像分辨率对自动识别性能的影响，并进一步证明了与其他方法相比，所提出的SR方法的有效性。

引用次数: 19

Detection and removal of hand-drawn underlines in a document image using approximate digital straightness 使用近似数字直线度检测和去除文档图像中的手绘下划线

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432576

Sanjoy Pratihar, Partha Bhowmick, S. Sural, J. Mukhopadhyay

A novel algorithm for detection and removal of underlines present in a scanned document page is proposed. The underlines treated here are hand-drawn and of various patterns. One of the important features of these underlines is that they are drawn by hand in almost a horizontal fashion. To locate these underlines, we detect the edges of their covers as a sequence of approximately straight segments, which are grown horizontally. The novelty of the algorithm lies in the detection of almost straight segments from the boundary edge map of the underline parts. After getting the exact cover of the underlines, an effective strategy is taken for underline removal. Experimental results are given to show the efficiency and robustness of the method.

提出了一种检测和去除扫描文档页面中下划线的新算法。这里处理的下划线是手绘的，有各种各样的图案。这些下划线的一个重要特征是，它们几乎是以水平的方式手工绘制的。为了定位这些下划线，我们将其覆盖的边缘检测为一系列近似直线的片段，这些片段水平生长。该算法的新颖之处在于从下划线部分的边界边缘图中检测出几乎是直线的部分。在得到下划线的确切覆盖后，采取有效的策略来去除下划线。实验结果表明了该方法的有效性和鲁棒性。

引用次数: 2

Margin noise removal from printed document images 从打印文档图像中去除边缘噪声

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432570

Soumyadeep Dey, J. Mukhopadhyay, S. Sural, Partha Bhowmick

In this paper, we propose a technique for removing margin noise (both textual and non-textual noise) from scanned document images. We perform layout analysis to detect words, lines, and paragraphs in the document image. These detected elements are classified into text and non-text components on the basis of their characteristics (size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the margin noise. We evaluate our algorithm on several scanned pages of Bengali literature books.

在本文中，我们提出了一种从扫描文档图像中去除边缘噪声(文本和非文本噪声)的技术。我们执行布局分析来检测文档图像中的单词、行和段落。这些检测到的元素根据其特征(大小、位置等)分为文本和非文本组件。寻找文本块的几何属性来检测和去除边缘噪声。我们在扫描的几页孟加拉文学书籍上评估了我们的算法。

引用次数: 10

On performance analysis of end-to-end OCR systems of Indic scripts 端到端印度文字OCR系统的性能分析

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432577

P. P. Kumar, C. Bhagvati, A. Agarwal

Performance evaluation of End-to-End OCR systems of Indic scripts requires matching of UNICODE sequences of OCR output and ground truth. In the literature, Levenshtein edit distance has been used to compute error rates of OCR systems but the accuracies are not explicitly reported. In the present work, we have proposed an accuracy measure based on edit distance and used it in conjunction with error rate to report the performance of an OCR system. We have analyzed the relationship between accuracy and error rates in a quantitative manner. Our analysis has shown that accuracy and error rate are independent of each other and so both are needed to report complete performance of an OCR system. Proposed approach is applicable to all the Indic scripts and the experimental results on different scripts like Devanagari, Telugu, Kannada etc. are shown.

对印度文字的端到端OCR系统进行性能评估，需要将OCR输出的UNICODE序列与真值进行匹配。在文献中，Levenshtein编辑距离已被用于计算OCR系统的错误率，但精度没有明确报道。在目前的工作中，我们提出了一种基于编辑距离的精度度量，并将其与错误率结合使用来报告OCR系统的性能。我们定量地分析了准确率和错误率之间的关系。我们的分析表明，准确率和错误率是相互独立的，因此需要两者来报告OCR系统的完整性能。所提出的方法适用于所有印度文字，并显示了不同文字的实验结果，如Devanagari，泰卢固语，卡纳达语等。

引用次数: 4

Hindi handwritten word recognition using HMM and symbol tree 使用HMM和符号树的印地语手写单词识别

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432556

S. Belhe, Chetan Paulzagade, Akash Deshmukh, Saumya Jetley, Kapil Mehrotra

The proposed approach performs recognition of online handwritten isolated Hindi words using a combination of HMMs trained on Devanagari symbols and a tree formed by the multiple, possible sequences of recognized symbols. In general, words in Indic languages are composed of a number of aksharas or syllables, which in turn are formed by groups of consonants and vowel modifiers. Segmentation of aksharas is critical to accurate recognition of both recognition primitives as well as the complete word. Also, recognition in itself is an intricate job. This holistic task of akshara segmentation, symbol identification and subsequent word recognition is targeted in our work. It is handled in an integrated segmentation-recognition framework. By making use of online stroke information for postulating symbol candidates and deriving HOG feature set from their image counterparts, the recognition becomes independent of stroke order and stroke shape variations. Thus, the system is well suited to unconstrained handwriting. Data for this work is collected from different parts of India where Hindi language is predominantly in use. Symbols extracted from 60,000 words are used to train and test 140 symbol-HMM models. The system is designed to output one or more candidate words to the user, by tracing multiple tree paths (up to leaf nodes) under the condition that the symbol likelihood (confidence score) at every node is above threshold. Tests performed on 10,000 words yield an accuracy of 89%.

所提出的方法使用在Devanagari符号上训练的hmm和由多个可能的识别符号序列组成的树的组合来识别在线手写的孤立的印地语单词。一般来说，印度语中的单词是由许多音节组成的，而这些音节又由辅音和元音修饰语组成。aksharas的分割对于识别原语和完整词的准确识别至关重要。此外，识别本身就是一项复杂的工作。这个整体任务的akshara分割，符号识别和随后的词识别是我们的工作目标。它在一个集成的分割识别框架中处理。通过利用在线笔画信息来假设候选符号，并从对应的图像中提取HOG特征集，使识别不受笔画顺序和笔画形状变化的影响。因此，该系统非常适合不受约束的书写。这项工作的数据是从印度主要使用印地语的不同地区收集的。从60,000个单词中提取的符号用于训练和测试140个符号hmm模型。在每个节点的符号似然度(置信度得分)高于阈值的条件下，系统通过跟踪多个树路径(直到叶节点)向用户输出一个或多个候选词。对1万个单词进行测试，准确率达到89%。

{"title":"Hindi handwritten word recognition using HMM and symbol tree","authors":"S. Belhe, Chetan Paulzagade, Akash Deshmukh, Saumya Jetley, Kapil Mehrotra","doi":"10.1145/2432553.2432556","DOIUrl":"https://doi.org/10.1145/2432553.2432556","url":null,"abstract":"The proposed approach performs recognition of online handwritten isolated Hindi words using a combination of HMMs trained on Devanagari symbols and a tree formed by the multiple, possible sequences of recognized symbols.\u0000 In general, words in Indic languages are composed of a number of aksharas or syllables, which in turn are formed by groups of consonants and vowel modifiers. Segmentation of aksharas is critical to accurate recognition of both recognition primitives as well as the complete word. Also, recognition in itself is an intricate job. This holistic task of akshara segmentation, symbol identification and subsequent word recognition is targeted in our work. It is handled in an integrated segmentation-recognition framework. By making use of online stroke information for postulating symbol candidates and deriving HOG feature set from their image counterparts, the recognition becomes independent of stroke order and stroke shape variations. Thus, the system is well suited to unconstrained handwriting.\u0000 Data for this work is collected from different parts of India where Hindi language is predominantly in use. Symbols extracted from 60,000 words are used to train and test 140 symbol-HMM models. The system is designed to output one or more candidate words to the user, by tracing multiple tree paths (up to leaf nodes) under the condition that the symbol likelihood (confidence score) at every node is above threshold. Tests performed on 10,000 words yield an accuracy of 89%.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131049408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Automatic localization and correction of line segmentation errors 自动定位和修正线段错误

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432555

Anand Mishra, Naveen Sankaran, Viresh Ranjan, C. V. Jawahar

Text line segmentation is a basic step in any OCR system. Its failure deteriorates the performance of OCR engines. This is especially true for the Indian languages due to the nature of scripts. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. In this work we design a text line segmentation post processor which automatically localizes and corrects the segmentation errors. The proposed segmentation post processor, which works in a "learning by examples" framework, is not only independent to segmentation algorithms but also robust to the diversity of scanned pages. We show over 5% improvement in text line segmentation on a large dataset of scanned pages for multiple Indian languages.

文本线分割是任何OCR系统的基本步骤。它的失效降低了OCR发动机的性能。由于文字的性质，印度语言尤其如此。文献中提出了许多分割算法。通常，这些算法不能动态地适应给定的页面，因此往往对某些特定区域或某些特定页面产生较差的分割。本文设计了一种文本行分割后置处理器，可以自动定位和修正文本行分割错误。所提出的分割后处理器在“样例学习”框架下工作，不仅独立于分割算法，而且对扫描页面的多样性具有鲁棒性。我们在多个印度语言的扫描页面的大型数据集上显示了超过5%的文本行分割改进。

引用次数: 1

Lightweight user-adaptive handwriting recognizer for resource constrained handheld devices 轻量级用户自适应手写识别器，用于资源受限的手持设备

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432574

D. Dutta, Aruni Roy Chowdhury, U. Bhattacharya, S. K. Parui

Here, we present our recent attempt to develop a lightweight handwriting recognizer suitable for resource constrained handheld devices. Such an application requires real-time recognition of handwritten characters produced on their touchscreens. The proposed approach is well suited for minimal user-lag on devices having only limited computing power in sharp contrast to standard laptops or desktop computers. Moreover, the approach is user-adaptive in the sense that it can adapt through user corrections to wrong predictions. With an increasing number of interactive corrections by the user, the recognition accuracy improves significantly. An input stroke is first re-sampled generating a fixed small number of sample points such that at most two critical points (points corresponding to high curvature) are preserved. We use their x- and y-coordinates as the feature vector and do not compute any other high-level feature vector. The squared Mahalanobis distance is used to identify each stroke of the input sample as one of several stroke categories pre-determined based on a large pool of training samples. The inverted covariance matrix and mean vector for a stroke class that are required for computing the Mahalanobis distance are pre-calculated and stored as Serialized Objects on the SD card of the device. A Look-Up Table (LUT) of stroke combinations as keys and corresponding character class as values is used for the final Unicode character output. In case of an incorrect character output, user corrections are used to automatically update the LUT adapting to the user's particular handwriting style.

在这里，我们介绍了我们最近的尝试，开发一个轻量级的手写识别器，适用于资源有限的手持设备。这样的应用程序需要实时识别触摸屏上产生的手写字符。与标准笔记本电脑或台式电脑形成鲜明对比的是，所提出的方法非常适合于只有有限计算能力的设备上的最小用户延迟。此外，该方法是用户自适应的，即它可以通过用户纠正错误的预测来适应。随着用户交互校正次数的增加，识别精度显著提高。首先对输入行程重新采样，生成固定数量的采样点，使得最多保留两个临界点(对应于高曲率的点)。我们使用它们的x和y坐标作为特征向量，不计算任何其他高级特征向量。马氏距离的平方用于将输入样本的每个笔画识别为基于大量训练样本池预先确定的几个笔画类别之一。计算马氏距离所需的笔画类的倒协方差矩阵和平均向量被预先计算并作为序列化对象存储在设备的SD卡上。将笔画组合作为键并将相应的字符类作为值的查找表(LUT)用于最终的Unicode字符输出。如果出现不正确的字符输出，则使用用户更正来自动更新LUT，以适应用户的特定手写风格。

{"title":"Lightweight user-adaptive handwriting recognizer for resource constrained handheld devices","authors":"D. Dutta, Aruni Roy Chowdhury, U. Bhattacharya, S. K. Parui","doi":"10.1145/2432553.2432574","DOIUrl":"https://doi.org/10.1145/2432553.2432574","url":null,"abstract":"Here, we present our recent attempt to develop a lightweight handwriting recognizer suitable for resource constrained handheld devices. Such an application requires real-time recognition of handwritten characters produced on their touchscreens. The proposed approach is well suited for minimal user-lag on devices having only limited computing power in sharp contrast to standard laptops or desktop computers. Moreover, the approach is user-adaptive in the sense that it can adapt through user corrections to wrong predictions. With an increasing number of interactive corrections by the user, the recognition accuracy improves significantly. An input stroke is first re-sampled generating a fixed small number of sample points such that at most two critical points (points corresponding to high curvature) are preserved. We use their x- and y-coordinates as the feature vector and do not compute any other high-level feature vector. The squared Mahalanobis distance is used to identify each stroke of the input sample as one of several stroke categories pre-determined based on a large pool of training samples. The inverted covariance matrix and mean vector for a stroke class that are required for computing the Mahalanobis distance are pre-calculated and stored as Serialized Objects on the SD card of the device. A Look-Up Table (LUT) of stroke combinations as keys and corresponding character class as values is used for the final Unicode character output. In case of an incorrect character output, user corrections are used to automatically update the LUT adapting to the user's particular handwriting style.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132354328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Model based table cell detection and content extraction from degraded document images 基于模型的退化文档图像的表单元检测和内容提取

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432565

Zhixin Shi, S. Setlur, V. Govindaraju

This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos. The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction. This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.

本文描述了一种从手写文档图像中检测和提取表格单元格内容的新方法。给定表格的模型和包含表格的文档图像，将检测手绘或预打印的表格，并自动提取表格单元格的内容。所描述的算法被设计用于处理退化的二进制文档图像。目标图像可能包含各种各样的噪声，从杂乱噪声、椒盐噪声到非文本对象(如图形和徽标)。该算法通过检测水平和垂直的表线候选点，有效地消除了多余的噪声，并识别出潜在的匹配表布局候选点。根据水平和垂直表线的交点位置将表表示为矩阵，匹配算法搜索与给定布局模型匹配的最佳表结构，并使用匹配分数来消除虚假的表线候选。然后将最优匹配的候选表用于提取单元格内容。该方法在一组文档页面图像上进行了测试，其中包含来自DARPA MADCAT阿拉伯语手写文档图像数据挑战集的表。初步结果表明，该方法是有效的，能够可靠地从表格单元格中提取文本。

{"title":"Model based table cell detection and content extraction from degraded document images","authors":"Zhixin Shi, S. Setlur, V. Govindaraju","doi":"10.1145/2432553.2432565","DOIUrl":"https://doi.org/10.1145/2432553.2432565","url":null,"abstract":"This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos.\u0000 The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction.\u0000 This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114165693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Choice of recognizable units for URDU OCR 乌尔都OCR可识别单位的选择

DAR '12

Pub Date : 2012-12-16 DOI: 10.1145/2432553.2432569

Gurpreet Singh Lehal

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.

在阿拉伯语OCR方面已经有相当多的工作。然而，所有这些工作都是基于Naskh风格。乌尔都语以阿拉伯字母为基础，但使用纳斯塔利克字体。Nastalique风格使得OCR，特别是字符分割成为一项极具挑战性的任务，因此大多数研究者都避开了字符分割阶段，而转向更高的识别单元。对于乌尔都语，研究人员考虑的下一个更高的识别单位是词缀，它位于字符和单词之间。一个连词是一个或多个字符的连接组成部分，通常一个乌尔都语单词由1到8个连词组成。一个相关的问题是为了识别目的而识别所有可能的结扎。为此，我们对乌尔都语语料库进行了统计分析，收集和整理乌尔都语结合力。独特的结扎数量超过了26,000个，识别这样一个庞大的班级也是一项艰巨的任务。因此有必要减少类数并寻找替代的识别单元。从OCR的角度来看，连接可以进一步分割为一个主连接组件和零个或多个次连接组件。主要成分代表结扎的基本形状，而次要连接成分对应于与结扎相关的点和变音符标记和特殊符号。为了减少类数，将具有相似主组件的连接组合在一起。通过进一步的统计分析，对主要成分进行计数和降序排列，生成了约2300个可管理的识别单元，覆盖了99%的乌尔都语语料库。

{"title":"Choice of recognizable units for URDU OCR","authors":"Gurpreet Singh Lehal","doi":"10.1145/2432553.2432569","DOIUrl":"https://doi.org/10.1145/2432553.2432569","url":null,"abstract":"There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126956151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

DAR '12

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀