Automatic classification of defect page content in scanned document collections

R. Huber-Mörk, Alexander Schindler
{"title":"Automatic classification of defect page content in scanned document collections","authors":"R. Huber-Mörk, Alexander Schindler","doi":"10.1109/ISPA.2013.6703735","DOIUrl":null,"url":null,"abstract":"We describe a method for defect detection and classification for collections of digital images of historical book documents. Undistorted text images from various books characterized by strong variation of language, font and layout properties are discriminated from typical errors in digitization processes such as occlusion by an operator's hand, visible book edge or image warping artifacts. A bag of local features approach is compared to a global characterization of location, size and orientation properties of detected keypoints. Machine learning is used to discriminate between those classes. Results for different features are compared for the task of discrimination between undistorted text and the major distortion class which is presence of the operator's hand, where features based on the bag of local features derived histograms achieved a cross-validation accuracy better than 99 percent on a representative data set. Taking into account up to three classes of distortions still resulted in cross-validation accuracies beyond 90 percent using bag of local features derived visual histograms for classifier input.","PeriodicalId":425029,"journal":{"name":"2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2013.6703735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We describe a method for defect detection and classification for collections of digital images of historical book documents. Undistorted text images from various books characterized by strong variation of language, font and layout properties are discriminated from typical errors in digitization processes such as occlusion by an operator's hand, visible book edge or image warping artifacts. A bag of local features approach is compared to a global characterization of location, size and orientation properties of detected keypoints. Machine learning is used to discriminate between those classes. Results for different features are compared for the task of discrimination between undistorted text and the major distortion class which is presence of the operator's hand, where features based on the bag of local features derived histograms achieved a cross-validation accuracy better than 99 percent on a representative data set. Taking into account up to three classes of distortions still resulted in cross-validation accuracies beyond 90 percent using bag of local features derived visual histograms for classifier input.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
扫描文档集合中缺陷页内容的自动分类
我们描述了一种历史文献数字图像集合的缺陷检测和分类方法。从各种图书中提取的未失真文本图像具有语言、字体和布局属性的强烈变化,可以区分出数字化过程中的典型错误,如操作员的手遮挡、可见的图书边缘或图像翘曲伪影。将局部特征包方法与检测关键点的位置、大小和方向属性的全局表征方法进行了比较。机器学习被用来区分这些类别。对不同特征的结果进行了比较,以区分未失真文本和主要失真类别(即操作员的手的存在),其中基于局部特征派生直方图的特征在代表性数据集上实现了优于99%的交叉验证精度。考虑到多达三类的扭曲,使用局部特征派生的视觉直方图作为分类器输入,仍然导致交叉验证准确率超过90%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Can we date an artist's work from catalogue photographs? Exudate segmentation on retinal atlas space Evaluation of degraded images using adaptive Jensen-Shannon divergence Contrast-based surface saliency Coverage segmentation of thin structures by linear unmixing and local centre of gravity attraction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1