{"title":"Key-Value Pair Searhing System via Tesseract OCR and Post Processing","authors":"Áron Zoltán Kaló, M. Sipos","doi":"10.1109/SAMI50585.2021.9378680","DOIUrl":null,"url":null,"abstract":"Optical character recognition systems make it possible to extract text from images. In many cases, this may be sufficient, but there are cases where key-value pairs are required. In this paper, we investigate the use of the open source Tesseract OCR system, to extract text data from images, and perform a key-value pair search. Image noise needs to be minimized with image processing algorithms before recognition. It is necessary to perform so-called post processing procedures on the output of the Tesseract. These post-processors can transform the result of the recognition performed by the OCR system. Those can improve the accuracy of the information extracted during the transformation, for example with the help of regular expressions. The key value pair search is performed after these procedures.","PeriodicalId":402414,"journal":{"name":"2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAMI50585.2021.9378680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Optical character recognition systems make it possible to extract text from images. In many cases, this may be sufficient, but there are cases where key-value pairs are required. In this paper, we investigate the use of the open source Tesseract OCR system, to extract text data from images, and perform a key-value pair search. Image noise needs to be minimized with image processing algorithms before recognition. It is necessary to perform so-called post processing procedures on the output of the Tesseract. These post-processors can transform the result of the recognition performed by the OCR system. Those can improve the accuracy of the information extracted during the transformation, for example with the help of regular expressions. The key value pair search is performed after these procedures.