{"title":"A fast and efficient method for document segmentation for OCR","authors":"B. Kruatrachue, P. Suthaphan","doi":"10.1109/TENCON.2001.949618","DOIUrl":null,"url":null,"abstract":"This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.","PeriodicalId":358168,"journal":{"name":"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TENCON.2001.949618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.