{"title":"Evaluating OCR and non-OCR text representations for learning document classifiers","authors":"Markus Junker, R. Hoch","doi":"10.1109/ICDAR.1997.620671","DOIUrl":null,"url":null,"abstract":"In the literature, many feature types and learning algorithms have been proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations that are suitable for standard learning algorithms. In this paper, we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.","PeriodicalId":435320,"journal":{"name":"Proceedings of the Fourth International Conference on Document Analysis and Recognition","volume":"14 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1997.620671","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
In the literature, many feature types and learning algorithms have been proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations that are suitable for standard learning algorithms. In this paper, we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.