{"title":"Automatic Document Classification using Deep Feature Selection and Knowledge Transfer","authors":"Aissam Jadli, M. Hain","doi":"10.1109/IRASET48871.2020.9092256","DOIUrl":null,"url":null,"abstract":"Documents in an ERP system flow from different sources (customer, supplier, etc.) and can have different layouts, sizes and subjects (invoices, delivery forms, checks, etc.). The classification of these documents is usually done manually before being saved in the ERP system or processed by an Optical Character Recognition (OCR) engine. In this paper, we investigate using different deep convolutional neural networks (CNN) to extract deep features from images of scanned documents. The extracted features are further processed using various machine learning classifiers such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Gaussian Naive Bayes (GNB). Different metrics were used (accuracy, precision, etc.) and examined to compare all models performances, while cross-validation approach at different folds sizes (4, 6, 8 and 10) was used to assess their generalization ability. The effect of dimensionality reduction techniques on overall performances was also explored. The best classification rate was 96.1%, which was achieved by combining LR and the VGG19 model. This very good performance despite the small dataset used (200 images) can allow using this approach in an ERP system as a preprocessing step in document manipulation for ERP users.","PeriodicalId":271840,"journal":{"name":"2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRASET48871.2020.9092256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Documents in an ERP system flow from different sources (customer, supplier, etc.) and can have different layouts, sizes and subjects (invoices, delivery forms, checks, etc.). The classification of these documents is usually done manually before being saved in the ERP system or processed by an Optical Character Recognition (OCR) engine. In this paper, we investigate using different deep convolutional neural networks (CNN) to extract deep features from images of scanned documents. The extracted features are further processed using various machine learning classifiers such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Gaussian Naive Bayes (GNB). Different metrics were used (accuracy, precision, etc.) and examined to compare all models performances, while cross-validation approach at different folds sizes (4, 6, 8 and 10) was used to assess their generalization ability. The effect of dimensionality reduction techniques on overall performances was also explored. The best classification rate was 96.1%, which was achieved by combining LR and the VGG19 model. This very good performance despite the small dataset used (200 images) can allow using this approach in an ERP system as a preprocessing step in document manipulation for ERP users.