A Data Mining Approach to Reading Order Detection

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Pub Date : 2007-09-23 DOI:10.1109/ICDAR.2007.4377050

Michelangelo Ceci, Margherita Berardi, G. Porcelli, D. Malerba

{"title":"A Data Mining Approach to Reading Order Detection","authors":"Michelangelo Ceci, Margherita Berardi, G. Porcelli, D. Malerba","doi":"10.1109/ICDAR.2007.4377050","DOIUrl":null,"url":null,"abstract":"Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the \"chains\" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.4377050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种基于数据挖掘的阅读顺序检测方法

确定从文档图像中提取的布局组件的读取顺序对于许多应用程序来说都是一个关键问题。它支持从与多个布局组件相关联的文本中重建单个文本元素，并使信息提取和基于内容的文档检索更加有效。文献中报道的所有方法的一个共同方面是，它们强烈依赖于特定的领域，当文档类或手头的任务发生变化时，它们几乎无法重用。本文采用数据挖掘的方法，从一组训练样本中获取领域特定知识，研究了布局组件阅读顺序的检测问题。学习方法的输入是用户定义的布局组件“链”的描述。仅利用空间信息来描述链，因此所提出的方法也适用于没有文本可以与布局组件相关联的情况。该方法引入了一个基于贝叶斯框架的概率分类器，用于重构单链或多链布局组件。它已经在一组文档图像上进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

自引率

0.00%

发文量