Extraction of line-word-character segments directly from run-length compressed printed text-documents

M. Javed, P. Nagabhushan, B. B. Chaudhuri
{"title":"Extraction of line-word-character segments directly from run-length compressed printed text-documents","authors":"M. Javed, P. Nagabhushan, B. B. Chaudhuri","doi":"10.1109/NCVPRIPG.2013.6776195","DOIUrl":null,"url":null,"abstract":"Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCVPRIPG.2013.6776195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
直接从运行长度压缩的打印文本-文档中提取行-字-字符段
将文本文档分割成行、词、字符是光学字符识别(OCR)中至关重要的预处理阶段,传统上是在未压缩的文档上进行的,尽管现实生活中的大多数文档出于传输和存储效率等原因都是以压缩形式存在的。但是,这意味着压缩后的图像应该解压缩,这会减少额外的计算资源。这种限制促使我们开始研究使用压缩文档进行文档图像分析。本文提出了一种新的方法来实现行、字、字符级的行、字、字符级的压缩打印文本分割。从压缩文件中提取水平投影轮廓曲线,利用局部极小点进行直线分割。但是,跟踪垂直信息导致跟踪运行长度压缩文件中的单词-字符并不是很直接。因此,我们提出了一种新技术,通过在智能序列中从每一行弹出列运行来实现同时进行单词和字符分割。该算法已在35个孟加拉语、卡纳达语和英语文本的压缩文档中进行了1101行、1409个单词和7582个字符的验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Image deblurring in super-resolution framework Surface fitting in SPECT imaging useful for detecting Parkinson's Disease and Scans Without Evidence of Dopaminergic Deficit Automatic number plate recognition system using modified stroke width transform UKF based multi-component phase estimation in digital holographic Moiré Feature preserving anisotropic diffusion for image restoration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1