{"title":"Effectiveness of Visual Features on Diverse Reading Orders for Information Extraction","authors":"S. Bhat, D. Adiga, M. Shah, Viveka Vyeth","doi":"10.1109/ICCC47050.2019.9064294","DOIUrl":null,"url":null,"abstract":"Information extraction from unstructured documents, meant only for human readers, has to be dealt with differently than from the structured documents. Unstructured documents include visual clues that draw human attention and convey the majority of information to readers. There have been several recent advancements in information extraction in such documents using the conventional natural language processing methodologies. However, there has been little to no work towards using the non-sequential relationships that are found only in unstructured documents for the task of information extraction. In this study, we propose novel methodologies to capture the non-sequential relationships present in the unstructured documents for the task of Named Entity Recognition (NER) using Conditional Random Field (CRF). We experiment with two different datasets having different types of logical reading order and we compare three sets of features. The NER model, that uses the proposed novel features, achieves mean F1-Scores of 68.15% on Retail Receipt and 85.54% on Air Ticket documents.","PeriodicalId":6739,"journal":{"name":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","volume":"46 1","pages":"1759-1764"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCC47050.2019.9064294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Information extraction from unstructured documents, meant only for human readers, has to be dealt with differently than from the structured documents. Unstructured documents include visual clues that draw human attention and convey the majority of information to readers. There have been several recent advancements in information extraction in such documents using the conventional natural language processing methodologies. However, there has been little to no work towards using the non-sequential relationships that are found only in unstructured documents for the task of information extraction. In this study, we propose novel methodologies to capture the non-sequential relationships present in the unstructured documents for the task of Named Entity Recognition (NER) using Conditional Random Field (CRF). We experiment with two different datasets having different types of logical reading order and we compare three sets of features. The NER model, that uses the proposed novel features, achieves mean F1-Scores of 68.15% on Retail Receipt and 85.54% on Air Ticket documents.