逻辑回归、随机森林和KNN模型在文本分类中的比较分析

Kanish Shah, Henil Patel, Devanshi Sanghvi, Manan Shah
{"title":"逻辑回归、随机森林和KNN模型在文本分类中的比较分析","authors":"Kanish Shah,&nbsp;Henil Patel,&nbsp;Devanshi Sanghvi,&nbsp;Manan Shah","doi":"10.1007/s41133-020-00032-0","DOIUrl":null,"url":null,"abstract":"<div><p>In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, <i>F</i>1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.</p></div>","PeriodicalId":100147,"journal":{"name":"Augmented Human Research","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s41133-020-00032-0","citationCount":"36","resultStr":"{\"title\":\"A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification\",\"authors\":\"Kanish Shah,&nbsp;Henil Patel,&nbsp;Devanshi Sanghvi,&nbsp;Manan Shah\",\"doi\":\"10.1007/s41133-020-00032-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, <i>F</i>1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.</p></div>\",\"PeriodicalId\":100147,\"journal\":{\"name\":\"Augmented Human Research\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1007/s41133-020-00032-0\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Augmented Human Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s41133-020-00032-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Augmented Human Research","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1007/s41133-020-00032-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

摘要

在当前时代,产生了大量的文本文档,迫切需要将它们组织成合适的结构,以便进行分类和正确定义类别。获取文本信息并组织该信息的关键技术被称为文本分类。然后通过确定内容的文本类型对类进行分类。基于本文使用的不同机器学习算法,将文本分类系统分为文本预处理、文本表示、分类器实现和分类四个部分。本文设计了一个BBC新闻文本分类系统。在分类器实现部分,作者分别选择并比较了逻辑回归、随机森林和k近邻作为我们的分类算法。然后,对这些分类器进行测试、分析和比较,最后得出结论。实验结论表明,在数据集上测试算法的基础上,BBC新闻文本分类模型得到了满意的结果。作者决定根据精密度、准确度、f1评分、支持度和混淆矩阵这五个参数进行比较。在所有这些参数中得分最高的分类器被称为BBC新闻数据集的最佳机器学习算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Haptic Gamer Suit for Enhancing VR Games Experience Retraction Note: Application on Virtual Reality for Enhanced Education Learning, Military Training and Sports The Impact of Transferring Embodiment and Work Efficiency Between Natural Body and Modular Body Systems Smart Life Saver Jacket: A New Jacket to Support CPR Operation Unraveling the Ethical Conundrum of Artificial Intelligence: A Synthesis of Literature and Case Studies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1