基于背景和前景信息的多方向英语文本行提取

2008 The Eighth IAPR International Workshop on Document Analysis Systems Pub Date : 2008-09-16 DOI:10.1109/DAS.2008.83

P. Roy, U. Pal, J. Lladós, F. Kimura

{"title":"基于背景和前景信息的多方向英语文本行提取","authors":"P. Roy, U. Pal, J. Lladós, F. Kimura","doi":"10.1109/DAS.2008.83","DOIUrl":null,"url":null,"abstract":"In graphical documents (map, engineering drawing), artistic documents etc. there exist many printed materials where text lines are not parallel to each other and they are multi-oriented and curve in nature. For the OCR of such documents we need to extract individual text lines from the documents. Extraction of individual text lines from multi-oriented and/or curved text document is a difficult problem. In this paper, we propose a novel method to extract individual text lines from such document pages and the method is based on the foreground and background information of the characters of the text. To take care of background information, water reservoir concept is used here. In the proposed scheme at first, individual components are detected and grouped into 3-character clusters using their inter-component distance, size and positional information. Applying concept of graph, initial 3-character clusters are merged to have larger cluster group. Using inter-character background information, orientations of the extreme characters of a larger cluster are decided and based on these orientation, two candidate regions are formed from the cluster. Finally, with the help of these candidate regions, individual lines are extracted. From the experiment, we obtained encouraging result.","PeriodicalId":423207,"journal":{"name":"2008 The Eighth IAPR International Workshop on Document Analysis Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Multi-Oriented English Text Line Extraction Using Background and Foreground Information\",\"authors\":\"P. Roy, U. Pal, J. Lladós, F. Kimura\",\"doi\":\"10.1109/DAS.2008.83\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In graphical documents (map, engineering drawing), artistic documents etc. there exist many printed materials where text lines are not parallel to each other and they are multi-oriented and curve in nature. For the OCR of such documents we need to extract individual text lines from the documents. Extraction of individual text lines from multi-oriented and/or curved text document is a difficult problem. In this paper, we propose a novel method to extract individual text lines from such document pages and the method is based on the foreground and background information of the characters of the text. To take care of background information, water reservoir concept is used here. In the proposed scheme at first, individual components are detected and grouped into 3-character clusters using their inter-component distance, size and positional information. Applying concept of graph, initial 3-character clusters are merged to have larger cluster group. Using inter-character background information, orientations of the extreme characters of a larger cluster are decided and based on these orientation, two candidate regions are formed from the cluster. Finally, with the help of these candidate regions, individual lines are extracted. From the experiment, we obtained encouraging result.\",\"PeriodicalId\":423207,\"journal\":{\"name\":\"2008 The Eighth IAPR International Workshop on Document Analysis Systems\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 The Eighth IAPR International Workshop on Document Analysis Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2008.83\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 The Eighth IAPR International Workshop on Document Analysis Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2008.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

在图形文件(地图、工程图纸)、艺术文件等中，存在着许多文字线条不平行的印刷品，它们具有多方位和曲线性。对于这些文档的OCR，我们需要从文档中提取单个文本行。从多方向和/或弯曲文本文档中提取单个文本行是一个难题。在本文中，我们提出了一种基于文本字符的前景和背景信息的新方法来从此类文档页面中提取单个文本行。为了照顾背景信息，这里使用了水库的概念。在该方案中，首先检测单个组件，并根据组件间距离、大小和位置信息将其分组为3个字符的聚类。利用图的概念，将初始的3字符聚类合并成更大的聚类群。利用字符间背景信息，确定较大集群的极端特征的方向，并基于这些方向从集群中形成两个候选区域。最后，在这些候选区域的帮助下，提取单个线条。从实验中，我们获得了令人鼓舞的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-Oriented English Text Line Extraction Using Background and Foreground Information

In graphical documents (map, engineering drawing), artistic documents etc. there exist many printed materials where text lines are not parallel to each other and they are multi-oriented and curve in nature. For the OCR of such documents we need to extract individual text lines from the documents. Extraction of individual text lines from multi-oriented and/or curved text document is a difficult problem. In this paper, we propose a novel method to extract individual text lines from such document pages and the method is based on the foreground and background information of the characters of the text. To take care of background information, water reservoir concept is used here. In the proposed scheme at first, individual components are detected and grouped into 3-character clusters using their inter-component distance, size and positional information. Applying concept of graph, initial 3-character clusters are merged to have larger cluster group. Using inter-character background information, orientations of the extreme characters of a larger cluster are decided and based on these orientation, two candidate regions are formed from the cluster. Finally, with the help of these candidate regions, individual lines are extracted. From the experiment, we obtained encouraging result.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 The Eighth IAPR International Workshop on Document Analysis Systems

自引率

0.00%

发文量

期刊最新文献

A Graphics Image Processing System Affine Invariant Recognition of Characters by Progressive Pruning Comprehensive Global Typography Extraction System for Electronic Book Documents Fast and Accurate Skew Estimation Based on Distance Transform A Proposal of Evaluation Method for Balance of White Space in Calligraphy by Using Horizon View Camera