文本文档分类算法和过程综述

IF 0.4 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Data Mining Modelling and Management Pub Date : 2023-01-24 DOI:10.46610/jodmm.2023.v08i01.002
N. Ranjan, R. Prasad
{"title":"文本文档分类算法和过程综述","authors":"N. Ranjan, R. Prasad","doi":"10.46610/jodmm.2023.v08i01.002","DOIUrl":null,"url":null,"abstract":"The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.","PeriodicalId":43061,"journal":{"name":"International Journal of Data Mining Modelling and Management","volume":"61 4 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2023-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Brief Survey of Text Document Classification Algorithms and Processes\",\"authors\":\"N. Ranjan, R. Prasad\",\"doi\":\"10.46610/jodmm.2023.v08i01.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.\",\"PeriodicalId\":43061,\"journal\":{\"name\":\"International Journal of Data Mining Modelling and Management\",\"volume\":\"61 4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.4000,\"publicationDate\":\"2023-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Data Mining Modelling and Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46610/jodmm.2023.v08i01.002\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Mining Modelling and Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46610/jodmm.2023.v08i01.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

非结构化数据的指数级增长是数据挖掘、文本分析或数据分析中最关键的挑战之一。世界上大约80%的数据以非结构化格式提供,由于其分析的复杂性,大多数数据都无人关注。由于大规模的术语和数据模式,保证基于用户偏好对文档进行分类的文本文档分类器的质量是一个很大的挑战。万维网正在迅速发展,电子文档的可用性也在增加。因此,文档的自动分类是信息系统组织和知识发现的关键因素。大多数现有的广泛的文本挖掘和分类策略都采用了基于术语的方法。然而,这种方法中的多义、同义问题令人十分关注。要根据上下文对文档进行分类,需要遵循基于上下文的方法。文本的语义分析克服了基于词的方法的局限性,提高了分类器的准确率。本文旨在强调可用于文本文档分类的重要算法、技术和方法。此外,本文还对文本文档分类的不同阶段进行了综述。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Brief Survey of Text Document Classification Algorithms and Processes
The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Data Mining Modelling and Management
International Journal of Data Mining Modelling and Management COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
1.10
自引率
0.00%
发文量
22
期刊介绍: Facilitating transformation from data to information to knowledge is paramount for organisations. Companies are flooded with data and conflicting information, but with limited real usable knowledge. However, rarely should a process be looked at from limited angles or in parts. Isolated islands of data mining, modelling and management (DMMM) should be connected. IJDMMM highlightes integration of DMMM, statistics/machine learning/databases, each element of data chain management, types of information, algorithms in software; from data pre-processing to post-processing; between theory and applications. Topics covered include: -Artificial intelligence- Biomedical science- Business analytics/intelligence, process modelling- Computer science, database management systems- Data management, mining, modelling, warehousing- Engineering- Environmental science, environment (ecoinformatics)- Information systems/technology, telecommunications/networking- Management science, operations research, mathematics/statistics- Social sciences- Business/economics, (computational) finance- Healthcare, medicine, pharmaceuticals- (Computational) chemistry, biology (bioinformatics)- Sustainable mobility systems, intelligent transportation systems- National security
期刊最新文献
Security Challenges in Data Collection and Processing in Industry 4.0 Implementation EEG-Based Human Stress Level Predictor Using Customized EEGNet Model Significance of Information Communication Technology and Assistive Technology in Relation to the Mentally Retarded Children’s Education Detection of Parkinson's Disease using Machine Learning Algorithms and Handwriting Analysis Predicting Resident Intention Using Machine Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1