逻辑回归、随机森林和KNN模型在文本分类中的比较分析

Augmented Human Research Pub Date : 2020-03-05 DOI:10.1007/s41133-020-00032-0

Kanish Shah, Henil Patel, Devanshi Sanghvi, Manan Shah

{"title":"逻辑回归、随机森林和KNN模型在文本分类中的比较分析","authors":"Kanish Shah, Henil Patel, Devanshi Sanghvi, Manan Shah","doi":"10.1007/s41133-020-00032-0","DOIUrl":null,"url":null,"abstract":"<div><p>In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, <i>F</i>1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.</p></div>","PeriodicalId":100147,"journal":{"name":"Augmented Human Research","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s41133-020-00032-0","citationCount":"36","resultStr":"{\"title\":\"A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification\",\"authors\":\"Kanish Shah, Henil Patel, Devanshi Sanghvi, Manan Shah\",\"doi\":\"10.1007/s41133-020-00032-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, <i>F</i>1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.</p></div>\",\"PeriodicalId\":100147,\"journal\":{\"name\":\"Augmented Human Research\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1007/s41133-020-00032-0\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Augmented Human Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s41133-020-00032-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Augmented Human Research","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1007/s41133-020-00032-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

摘要

在当前时代，产生了大量的文本文档，迫切需要将它们组织成合适的结构，以便进行分类和正确定义类别。获取文本信息并组织该信息的关键技术被称为文本分类。然后通过确定内容的文本类型对类进行分类。基于本文使用的不同机器学习算法，将文本分类系统分为文本预处理、文本表示、分类器实现和分类四个部分。本文设计了一个BBC新闻文本分类系统。在分类器实现部分，作者分别选择并比较了逻辑回归、随机森林和k近邻作为我们的分类算法。然后，对这些分类器进行测试、分析和比较，最后得出结论。实验结论表明，在数据集上测试算法的基础上，BBC新闻文本分类模型得到了满意的结果。作者决定根据精密度、准确度、f1评分、支持度和混淆矩阵这五个参数进行比较。在所有这些参数中得分最高的分类器被称为BBC新闻数据集的最佳机器学习算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Augmented Human Research

自引率

0.00%

发文量