{"title":"Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents","authors":"Siyuan Chen, Song Mao, G. Thoma","doi":"10.1109/ICDAR.2007.231","DOIUrl":null,"url":null,"abstract":"Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.231","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.