{"title":"A case restoration approach to named entity tagging in degraded documents","authors":"R. Srihari, Cheng Niu, W. Li, Jihong Ding","doi":"10.1109/ICDAR.2003.1227756","DOIUrl":null,"url":null,"abstract":"This paper describes a novel approach to namedentity (NE) tagging on degraded documents. NE taggingis the process of identifying salient text strings inunstructured text, corresponding to names of people,places, organizations, times/dates, etc. Although NEtagging is typically part of a larger informationextraction process, it has other applications, such asimproving search in an information retrieval system, andpost-processing the results of an OCR system. We focuson degraded documents, i.e. case insensitive documentsthat lack orthographic information. Examples includeoutput of speech recognition systems, as well as e-mail.The traditional approach involves retraining an NEtagger on degraded text, a cumbersome operation. Thispaper describes an approach whereby text is first\"restored\" to its implicit case sensitive form, andsubsequently processed by the original NE tagger.Results show that this new approach leads to far lessprecision loss in NE tagging of degraded documents.","PeriodicalId":249193,"journal":{"name":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2003.1227756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
This paper describes a novel approach to namedentity (NE) tagging on degraded documents. NE taggingis the process of identifying salient text strings inunstructured text, corresponding to names of people,places, organizations, times/dates, etc. Although NEtagging is typically part of a larger informationextraction process, it has other applications, such asimproving search in an information retrieval system, andpost-processing the results of an OCR system. We focuson degraded documents, i.e. case insensitive documentsthat lack orthographic information. Examples includeoutput of speech recognition systems, as well as e-mail.The traditional approach involves retraining an NEtagger on degraded text, a cumbersome operation. Thispaper describes an approach whereby text is first"restored" to its implicit case sensitive form, andsubsequently processed by the original NE tagger.Results show that this new approach leads to far lessprecision loss in NE tagging of degraded documents.