A Data Mining Approach to Reading Order Detection

Michelangelo Ceci, Margherita Berardi, G. Porcelli, D. Malerba
{"title":"A Data Mining Approach to Reading Order Detection","authors":"Michelangelo Ceci, Margherita Berardi, G. Porcelli, D. Malerba","doi":"10.1109/ICDAR.2007.4377050","DOIUrl":null,"url":null,"abstract":"Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the \"chains\" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.4377050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种基于数据挖掘的阅读顺序检测方法
确定从文档图像中提取的布局组件的读取顺序对于许多应用程序来说都是一个关键问题。它支持从与多个布局组件相关联的文本中重建单个文本元素,并使信息提取和基于内容的文档检索更加有效。文献中报道的所有方法的一个共同方面是,它们强烈依赖于特定的领域,当文档类或手头的任务发生变化时,它们几乎无法重用。本文采用数据挖掘的方法,从一组训练样本中获取领域特定知识,研究了布局组件阅读顺序的检测问题。学习方法的输入是用户定义的布局组件“链”的描述。仅利用空间信息来描述链,因此所提出的方法也适用于没有文本可以与布局组件相关联的情况。该方法引入了一个基于贝叶斯框架的概率分类器,用于重构单链或多链布局组件。它已经在一组文档图像上进行了评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Language-Based Feature Extraction Using Template-Matching in Farsi/Arabic Handwritten Numeral Recognition A Method of Annotation Extraction from Paper Documents Using Alignment Based on Local Arrangements of Feature Points PRAAD: Preprocessing and Analysis Tool for Arabic Ancient Documents A New Vectorial Signature for Quick Symbol Indexing, Filtering and Recognition Online Handwritten Japanese Character String Recognition Incorporating Geometric Context
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1