首页 > 最新文献

MOCR '13最新文献

英文 中文
Recognition of offline handwritten numerals using an ensemble of MLPs combined by Adaboost 使用Adaboost组合的mlp集合识别离线手写数字
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505380
Tarun Jindal, U. Bhattacharya
In this article, we present our recent study of offline recognition of handwritten numerals of three Indian scripts -- Devanagari, Bangla and Oriya. Here, we propose a novel approach to combination of multiple MLP classifiers with varying number of hidden nodes based on Adaboost technique. In this recognition study, we used Zernike moment features of different orders. We obtained classification results corresponding to a number of orders of this moment function and the best classification result for each script was obtained when the feature vector consists of moment values up to the order 8. It is well-known that the classification performance of an MLP largely depends on the choice of the number of hidden nodes. In the present work, we studied use of boosting as a solution to this problem of using MLP as a classifier in real-life applications. Here, we use an ensemble of MLP classifiers having different hidden layer sizes and results of their classification are combined based on Adaboost technique. Classification results have been provided using publicly available databases [1] of offline handwritten numeral images of three Indian scripts.
在这篇文章中,我们介绍了我们最近对三种印度文字手写数字的离线识别研究——Devanagari、孟加拉语和奥里亚语。在这里,我们提出一个新颖的方法来组合多个MLP分类器与不同数量的隐藏节点基于演算法技术。在本识别研究中,我们使用了不同阶次的泽尼克矩特征。我们得到了该矩函数的多个阶数对应的分类结果,当特征向量由最高为8阶的矩值组成时,每个脚本的分类结果最好。众所周知,MLP的分类性能在很大程度上取决于隐藏节点数量的选择。在目前的工作中,我们研究了在实际应用中使用boosting作为MLP作为分类器的解决方案。在这里,我们使用具有不同隐藏层大小的MLP分类器的集合,并基于Adaboost技术将其分类结果组合在一起。使用公开可用的数据库提供了分类结果[1]的脱机手写数字图像三个印度脚本。
{"title":"Recognition of offline handwritten numerals using an ensemble of MLPs combined by Adaboost","authors":"Tarun Jindal, U. Bhattacharya","doi":"10.1145/2505377.2505380","DOIUrl":"https://doi.org/10.1145/2505377.2505380","url":null,"abstract":"In this article, we present our recent study of offline recognition of handwritten numerals of three Indian scripts -- Devanagari, Bangla and Oriya. Here, we propose a novel approach to combination of multiple MLP classifiers with varying number of hidden nodes based on Adaboost technique. In this recognition study, we used Zernike moment features of different orders. We obtained classification results corresponding to a number of orders of this moment function and the best classification result for each script was obtained when the feature vector consists of moment values up to the order 8. It is well-known that the classification performance of an MLP largely depends on the choice of the number of hidden nodes. In the present work, we studied use of boosting as a solution to this problem of using MLP as a classifier in real-life applications. Here, we use an ensemble of MLP classifiers having different hidden layer sizes and results of their classification are combined based on Adaboost technique. Classification results have been provided using publicly available databases [1] of offline handwritten numeral images of three Indian scripts.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114950027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Can we build language-independent OCR using LSTM networks? 我们可以使用LSTM网络构建语言无关的OCR吗?
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505394
A. Ul-Hasan, T. Breuel
Language models or recognition dictionaries are usually considered an essential step in OCR. However, using a language model complicates training of OCR systems, and it also narrows the range of texts that an OCR system can be used with. Recent results have shown that Long Short-Term Memory (LSTM) based OCR yields low error rates even without language modeling. In this paper, we explore the question to what extent LSTM models can be used for multilingual OCR without the use of language models. To do this, we measure cross-language performance of LSTM models trained on different languages. LSTM models show good promise to be used for language-independent OCR. The recognition errors are very low (around 1%) without using any language model or dictionary correction.
语言模型或识别字典通常被认为是OCR的重要步骤。然而,使用语言模型使OCR系统的训练变得复杂,并且它也缩小了OCR系统可以使用的文本范围。最近的研究结果表明,即使没有语言建模,基于长短期记忆(LSTM)的OCR也能产生较低的错误率。在本文中,我们探讨了在不使用语言模型的情况下LSTM模型在多大程度上可以用于多语言OCR的问题。为此,我们测量了在不同语言上训练的LSTM模型的跨语言性能。LSTM模型在用于语言无关OCR方面表现出良好的前景。在不使用任何语言模型或字典校正的情况下,识别误差非常低(约1%)。
{"title":"Can we build language-independent OCR using LSTM networks?","authors":"A. Ul-Hasan, T. Breuel","doi":"10.1145/2505377.2505394","DOIUrl":"https://doi.org/10.1145/2505377.2505394","url":null,"abstract":"Language models or recognition dictionaries are usually considered an essential step in OCR. However, using a language model complicates training of OCR systems, and it also narrows the range of texts that an OCR system can be used with. Recent results have shown that Long Short-Term Memory (LSTM) based OCR yields low error rates even without language modeling. In this paper, we explore the question to what extent LSTM models can be used for multilingual OCR without the use of language models. To do this, we measure cross-language performance of LSTM models trained on different languages. LSTM models show good promise to be used for language-independent OCR. The recognition errors are very low (around 1%) without using any language model or dictionary correction.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114810342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Recognition of Nastalique Urdu ligatures 纳斯塔利克乌尔都语结扎的识别
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505379
Gurpreet Singh Lehal, Ankur Rana
There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.
在阿拉伯语OCR方面已经有相当多的工作。然而,所有这些工作都是基于Naskh风格。乌尔都语以阿拉伯字母为基础,但使用纳斯塔利克字体。Nastalique风格使得OCR,特别是字符分割成为一项极具挑战性的任务,因此大多数研究者都避开了字符分割阶段,而转向更高的识别单元。对于乌尔都语,研究人员考虑的下一个更高的识别单位是词缀,它位于字符和单词之间。一个连词是一个或多个字符的连接组成部分,通常一个乌尔都语单词由1到8个连词组成。乌尔都语有25000多个结扎词,其中排名前4567的结扎词覆盖率达到99%。从OCR的角度来看,连接可以进一步分割为一个主连接组件和零个或多个次连接组件。主要成分代表结扎的基本形状,而次要连接成分对应于与结扎相关的点和变音符标记和特殊符号。为了减少类数,将具有相似主组件的连接组合在一起。在本文中,我们提出了一个识别由2190个主分量和17个次分量组成的9262个连接的系统。已经尝试了DCT、Gabor滤波器和基于分区的特征以及kNN、HMM和SVM分类器的各种组合,并报道了对预分割的连接的识别准确率为98%。
{"title":"Recognition of Nastalique Urdu ligatures","authors":"Gurpreet Singh Lehal, Ankur Rana","doi":"10.1145/2505377.2505379","DOIUrl":"https://doi.org/10.1145/2505377.2505379","url":null,"abstract":"There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132149024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Low resolution Arabic recognition with multidimensional recurrent neural networks 基于多维递归神经网络的低分辨率阿拉伯语识别
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505385
Sheikh Faisal Rashid, M. Schambach, J. Rottland, Stephan von der Nüll
OCR of multi-font Arabic text is difficult due to large variations in character shapes from one font to another. It becomes even more challenging if the text is rendered at very low resolution. This paper describes a multi-font, low resolution, and open vocabulary OCR system based on a multidimensional recurrent neural network architecture. For this work, we have developed various systems, trained for single-font/single-size, single-font/multi-size, and multi-font/multi-size data of the well known Arabic printed text image database (APTI). The evaluation tasks from the second Arabic text recognition competition, organized in conjunction with ICDAR 2013, have been adopted. Ten Arabic fonts in six font size categories are used for evaluation. Results show that the proposed method performs very well on the task of printed Arabic text recognition even for very low resolution and small font size images. Overall, the system yields above 99% recognition accuracy at character and word level for most of the printed Arabic fonts.
多字体阿拉伯文本的OCR是困难的,因为从一种字体到另一种字体的字符形状有很大的变化。如果文本以非常低的分辨率呈现,则更具挑战性。介绍了一种基于多维递归神经网络结构的多字体、低分辨率、开放词汇OCR系统。为了这项工作,我们开发了各种系统,训练了著名的阿拉伯印刷文本图像数据库(APTI)的单字体/单尺寸、单字体/多尺寸和多字体/多尺寸数据。采用了与ICDAR 2013联合组织的第二届阿拉伯语文本识别竞赛的评估任务。评估使用了6种字体大小类别中的10种阿拉伯字体。结果表明,该方法在低分辨率、小字体图像的打印阿拉伯文文本识别中也能取得很好的效果。总体而言,该系统在字符和单词级别上对大多数印刷阿拉伯字体的识别准确率超过99%。
{"title":"Low resolution Arabic recognition with multidimensional recurrent neural networks","authors":"Sheikh Faisal Rashid, M. Schambach, J. Rottland, Stephan von der Nüll","doi":"10.1145/2505377.2505385","DOIUrl":"https://doi.org/10.1145/2505377.2505385","url":null,"abstract":"OCR of multi-font Arabic text is difficult due to large variations in character shapes from one font to another. It becomes even more challenging if the text is rendered at very low resolution. This paper describes a multi-font, low resolution, and open vocabulary OCR system based on a multidimensional recurrent neural network architecture. For this work, we have developed various systems, trained for single-font/single-size, single-font/multi-size, and multi-font/multi-size data of the well known Arabic printed text image database (APTI). The evaluation tasks from the second Arabic text recognition competition, organized in conjunction with ICDAR 2013, have been adopted. Ten Arabic fonts in six font size categories are used for evaluation. Results show that the proposed method performs very well on the task of printed Arabic text recognition even for very low resolution and small font size images. Overall, the system yields above 99% recognition accuracy at character and word level for most of the printed Arabic fonts.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127987758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
A robust table registration method for batch table OCR processing 批处理表OCR处理的鲁棒表注册方法
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505383
Jinyu Zuo, Esin Darici
A robust table registration method is proposed in this paper for a better understanding on structured information from scanned table images. Scanned images can be heavily degraded because of scanning effects, binarization or purely document itself. For batch processing images with the same table structure, normally the table model is provided and can be used to overcome most challenging quality factors. The given table model is used as the ground truth in this paper. However, only rough precision is needed on table cell dimensions and this makes providing the table model an easier task. The method was tested on Multilingual Automatic Document Classification Analysis and Translation (MADCAT) images and a promising performance is achieved.
为了更好地理解扫描表图像中的结构化信息,本文提出了一种鲁棒的表配准方法。由于扫描效果、二值化或纯粹的文档本身,扫描图像可能会严重退化。对于具有相同表结构的批处理图像,通常提供表模型,并可用于克服最具挑战性的质量因素。本文采用给定的表模型作为基础真值。然而,表单元格尺寸只需要粗略的精度,这使得提供表模型成为一项更容易的任务。该方法在多语种自动文档分类分析与翻译(MADCAT)图像上进行了测试,取得了良好的效果。
{"title":"A robust table registration method for batch table OCR processing","authors":"Jinyu Zuo, Esin Darici","doi":"10.1145/2505377.2505383","DOIUrl":"https://doi.org/10.1145/2505377.2505383","url":null,"abstract":"A robust table registration method is proposed in this paper for a better understanding on structured information from scanned table images. Scanned images can be heavily degraded because of scanning effects, binarization or purely document itself. For batch processing images with the same table structure, normally the table model is provided and can be used to overcome most challenging quality factors. The given table model is used as the ground truth in this paper. However, only rough precision is needed on table cell dimensions and this makes providing the table model an easier task. The method was tested on Multilingual Automatic Document Classification Analysis and Translation (MADCAT) images and a promising performance is achieved.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123480359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Word level script recognition for Uighur document mixed with English script 维吾尔文与英文混合文字的词级文字识别
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505387
H. Ye, Liangrui Peng
Script recognition is one of the key technologies in Uighur OCR research, as it is common to find English words or sentences in Uighur documents, especially in scientific documents. A word level based script recognition is presented in this paper. The original Uighur text images are segmented into text lines. The text line images are then segmented into word level images. Features are extracted in sub-blocks of the word level images. Two features, edge hinge feature and Gabor feature, are introduced and compared. SVM is adopted as classifier and trained by the labeled segmented word images. The final script recognition results are given by fusing the results of sub-blocks of segmented word images. Experimental results are made on segmented word images and text line images, which prove the effectiveness of the proposed method.
脚本识别是维吾尔语OCR研究的关键技术之一,因为在维吾尔语文档中,特别是在科技文档中,发现英语单词或句子是很常见的。本文提出了一种基于词级的文字识别方法。将原始的维吾尔文本图像分割成文本行。然后将文本行图像分割为单词级图像。在词级图像的子块中提取特征。介绍并比较了两种特征:边铰特征和Gabor特征。采用支持向量机作为分类器,通过标记好的分割词图像进行训练。将分割后的词图像的子块结果进行融合,得到最终的文字识别结果。在分词图像和文本行图像上进行了实验,验证了该方法的有效性。
{"title":"Word level script recognition for Uighur document mixed with English script","authors":"H. Ye, Liangrui Peng","doi":"10.1145/2505377.2505387","DOIUrl":"https://doi.org/10.1145/2505377.2505387","url":null,"abstract":"Script recognition is one of the key technologies in Uighur OCR research, as it is common to find English words or sentences in Uighur documents, especially in scientific documents. A word level based script recognition is presented in this paper. The original Uighur text images are segmented into text lines. The text line images are then segmented into word level images. Features are extracted in sub-blocks of the word level images. Two features, edge hinge feature and Gabor feature, are introduced and compared. SVM is adopted as classifier and trained by the labeled segmented word images. The final script recognition results are given by fusing the results of sub-blocks of segmented word images. Experimental results are made on segmented word images and text line images, which prove the effectiveness of the proposed method.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121245690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Levenshtein distance metric based holistic handwritten word recognition 基于Levenshtein距离度量的整体手写单词识别
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505378
S. D. Chowdhury, U. Bhattacharya, S. K. Parui
The rapid spread of pen-based digital devices and touch screen devices coupled with their affordability, and capability to take technology and digitization of data to the grassroots, has made online handwriting recognition an active field of research. The relevance of research on on-line handwriting recognition for Indian scripts is particularly high because the challenges posed by Indian scripts are different from other scripts. This is not only because of their extremely large alphabet size but also because the inter class variability among several classes is very small. In this article, we introduce a limited vocabulary online unconstrained handwritten Bangla (a major Indian script) word recognizer based on a novel word level feature representation. Here, we consider three different features extracted from a word sample and three event strings are generated corresponding to these three features. A distance function is formulated which uses the Levenshtein distance metric to compute the distance between two triplets of event strings representing two word samples. The nearest neighbour scheme is used to classify the input sample. We have simulated the proposed approach on vocabularies of varying sizes and the recognition performances are encouraging.
笔式数字设备和触摸屏设备的迅速普及,加上它们的可负担性,以及将技术和数据数字化带到基层的能力,使得在线手写识别成为一个活跃的研究领域。由于印度文字所面临的挑战不同于其他文字,因此印度文字在线手写识别研究的相关性特别高。这不仅是因为它们的字母非常大,而且还因为几个班级之间的班级差异非常小。在本文中,我们介绍了一个基于一种新颖的词级特征表示的有限词汇量的在线无约束手写孟加拉语(一种主要的印度文字)词识别器。这里,我们考虑从一个单词样本中提取三个不同的特征,并生成三个与这三个特征相对应的事件字符串。距离函数使用Levenshtein距离度量来计算代表两个单词样本的事件字符串的两个三元组之间的距离。使用最近邻方案对输入样本进行分类。我们对不同大小的词汇表进行了模拟,结果表明该方法的识别效果令人鼓舞。
{"title":"Levenshtein distance metric based holistic handwritten word recognition","authors":"S. D. Chowdhury, U. Bhattacharya, S. K. Parui","doi":"10.1145/2505377.2505378","DOIUrl":"https://doi.org/10.1145/2505377.2505378","url":null,"abstract":"The rapid spread of pen-based digital devices and touch screen devices coupled with their affordability, and capability to take technology and digitization of data to the grassroots, has made online handwriting recognition an active field of research. The relevance of research on on-line handwriting recognition for Indian scripts is particularly high because the challenges posed by Indian scripts are different from other scripts. This is not only because of their extremely large alphabet size but also because the inter class variability among several classes is very small. In this article, we introduce a limited vocabulary online unconstrained handwritten Bangla (a major Indian script) word recognizer based on a novel word level feature representation. Here, we consider three different features extracted from a word sample and three event strings are generated corresponding to these three features. A distance function is formulated which uses the Levenshtein distance metric to compute the distance between two triplets of event strings representing two word samples. The nearest neighbour scheme is used to classify the input sample. We have simulated the proposed approach on vocabularies of varying sizes and the recognition performances are encouraging.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127663514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multilingual OCR research and applications: an overview 多语言OCR研究与应用综述
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2509977
Xujun Peng, Huaigu Cao, S. Setlur, V. Govindaraju, P. Natarajan
This paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely be to develop a system that can, with the use of language-specific training data, be re-targeted to process different languages with minimal modifications. This is still an open area of research with plenty of challenges. This is particularly true for multilingual handwriting recognition due to the added complexity of variations in writing styles even within the same scripts. Challenges for multilingual OCR in preprocessing, feature extraction, script identification and recognition modeling and a brief survey of research in these areas are presented in the paper. Ideas for future research in multilingual OCR are outlined.
本文概述了当前离线多语言OCR领域的研究方法。通常,离线OCR系统是为特定的脚本或语言设计的。然而,多语言OCR的理想方法可能是开发一个系统,该系统可以使用特定于语言的训练数据,以最小的修改重新定位处理不同的语言。这仍然是一个充满挑战的开放研究领域。对于多语言手写识别来说尤其如此,因为即使在相同的脚本中,书写风格的变化也会增加复杂性。本文介绍了多语言OCR在预处理、特征提取、脚本识别和识别建模等方面面临的挑战,并简要介绍了这些领域的研究概况。展望了未来多语言OCR研究的思路。
{"title":"Multilingual OCR research and applications: an overview","authors":"Xujun Peng, Huaigu Cao, S. Setlur, V. Govindaraju, P. Natarajan","doi":"10.1145/2505377.2509977","DOIUrl":"https://doi.org/10.1145/2505377.2509977","url":null,"abstract":"This paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely be to develop a system that can, with the use of language-specific training data, be re-targeted to process different languages with minimal modifications. This is still an open area of research with plenty of challenges. This is particularly true for multilingual handwriting recognition due to the added complexity of variations in writing styles even within the same scripts. Challenges for multilingual OCR in preprocessing, feature extraction, script identification and recognition modeling and a brief survey of research in these areas are presented in the paper. Ideas for future research in multilingual OCR are outlined.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132352255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Text graphic separation in Indian newspapers 印度报纸的图文分离
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505393
Ritu Garg, Anukriti Bansal, S. Chaudhury, Sumantra Dutta Roy
Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.
报纸文章数字化是历史事件记录的重要手段。由于存在不同的字体大小、字体样式以及文本和非文本区域的随机放置,印度报纸的版面分析是一项具有挑战性的任务。在本文中,我们提出了一种新的框架,用于学习复杂布局下文本图形分离的最佳参数。学习问题已被制定为一个优化问题,使用EM算法根据文档内容的性质来学习最优参数。
{"title":"Text graphic separation in Indian newspapers","authors":"Ritu Garg, Anukriti Bansal, S. Chaudhury, Sumantra Dutta Roy","doi":"10.1145/2505377.2505393","DOIUrl":"https://doi.org/10.1145/2505377.2505393","url":null,"abstract":"Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132636597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models 基于多个脚本标识符和语言模型的双语Gurmukhi-English OCR
Pub Date : 2013-08-24 DOI: 10.1145/2505377.2505381
Gurpreet Singh Lehal
English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.
英语单词经常出现在Gurmukhi文本中。单语古尔穆克语OCR将把这些词识别为垃圾。有必要在古慕克语OCR中增加双语功能来识别英语文本。但是,增加双语功能会降低单语文本的识别精度,这是由于脚本识别中的错误。即使系统具有99%的文字识别准确率,对单语文本的识别准确率也会降低1%。在本文中,我们提出了一种双语OCR,它在识别单语廓尔穆克语文本时,与单语廓尔穆克语OCR相比,同时识别英语和廓尔穆克语脚本,而识别准确率没有明显降低。这是通过为英语和Gurmukhi脚本使用多个脚本识别引擎和语言模型来实现的。该系统首次实现了对英语和廓尔穆克语混合文本或只有廓尔穆克语/英语文本的文档图像的高精度识别。
{"title":"A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models","authors":"Gurpreet Singh Lehal","doi":"10.1145/2505377.2505381","DOIUrl":"https://doi.org/10.1145/2505377.2505381","url":null,"abstract":"English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
MOCR '13
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1