Automatic Document Classification using Deep Feature Selection and Knowledge Transfer

Aissam Jadli, M. Hain
{"title":"Automatic Document Classification using Deep Feature Selection and Knowledge Transfer","authors":"Aissam Jadli, M. Hain","doi":"10.1109/IRASET48871.2020.9092256","DOIUrl":null,"url":null,"abstract":"Documents in an ERP system flow from different sources (customer, supplier, etc.) and can have different layouts, sizes and subjects (invoices, delivery forms, checks, etc.). The classification of these documents is usually done manually before being saved in the ERP system or processed by an Optical Character Recognition (OCR) engine. In this paper, we investigate using different deep convolutional neural networks (CNN) to extract deep features from images of scanned documents. The extracted features are further processed using various machine learning classifiers such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Gaussian Naive Bayes (GNB). Different metrics were used (accuracy, precision, etc.) and examined to compare all models performances, while cross-validation approach at different folds sizes (4, 6, 8 and 10) was used to assess their generalization ability. The effect of dimensionality reduction techniques on overall performances was also explored. The best classification rate was 96.1%, which was achieved by combining LR and the VGG19 model. This very good performance despite the small dataset used (200 images) can allow using this approach in an ERP system as a preprocessing step in document manipulation for ERP users.","PeriodicalId":271840,"journal":{"name":"2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRASET48871.2020.9092256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Documents in an ERP system flow from different sources (customer, supplier, etc.) and can have different layouts, sizes and subjects (invoices, delivery forms, checks, etc.). The classification of these documents is usually done manually before being saved in the ERP system or processed by an Optical Character Recognition (OCR) engine. In this paper, we investigate using different deep convolutional neural networks (CNN) to extract deep features from images of scanned documents. The extracted features are further processed using various machine learning classifiers such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Gaussian Naive Bayes (GNB). Different metrics were used (accuracy, precision, etc.) and examined to compare all models performances, while cross-validation approach at different folds sizes (4, 6, 8 and 10) was used to assess their generalization ability. The effect of dimensionality reduction techniques on overall performances was also explored. The best classification rate was 96.1%, which was achieved by combining LR and the VGG19 model. This very good performance despite the small dataset used (200 images) can allow using this approach in an ERP system as a preprocessing step in document manipulation for ERP users.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于深度特征选择和知识转移的自动文档分类
ERP系统中的文档来自不同的来源(客户、供应商等),并且可以具有不同的布局、大小和主题(发票、交付表单、支票等)。这些文档的分类通常是手工完成的,然后保存在ERP系统中或由光学字符识别(OCR)引擎处理。在本文中,我们研究了使用不同的深度卷积神经网络(CNN)从扫描文档图像中提取深度特征。提取的特征使用各种机器学习分类器进行进一步处理,如逻辑回归(LR), k近邻(KNN),支持向量机(SVM)和高斯朴素贝叶斯(GNB)。使用不同的指标(准确度、精度等)来比较所有模型的性能,同时使用不同折叠大小(4、6、8和10)的交叉验证方法来评估它们的泛化能力。探讨了降维技术对整体性能的影响。LR与VGG19模型相结合的分类率最高,达到96.1%。尽管使用的数据集很小(200张图像),但这种非常好的性能允许在ERP系统中使用这种方法作为ERP用户文档操作的预处理步骤。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Conception of a Training System for Emergency Situation Managers Optimization by the Response Surface Methodology of Color Optimal Control of Wind Energy Generation System Synthesis and Characterisation of Anhydrous Proton Conducting Membranes Based on Sulfonated Poly(vinyl alcohol) and Silicotungstic Acid with or without Silica for Fuel Cell Applications Towards a behavioral network intrusion detection system based on the SVM model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1