分析矢量化方法对基于机器学习的推文情感分析在离线学习准备方面的影响

JUITA : Jurnal Informatika Pub Date : 2023-11-17 DOI:10.30595/juita.v11i2.17568

Yesi Novaria Kunang, Widya Putri Mentari

{"title":"分析矢量化方法对基于机器学习的推文情感分析在离线学习准备方面的影响","authors":"Yesi Novaria Kunang, Widya Putri Mentari","doi":"10.30595/juita.v11i2.17568","DOIUrl":null,"url":null,"abstract":"Twitter users use social media to express emotions about something, whether it is criticism or praise. Analyzing the opinions or sentiments in the tweets that Twitter users send can identify their emotions for a particular topic. This study aims to determine the impact of vectorization methods on public sentiment analysis regarding the readiness for offline learning in Indonesia during the Covid-19 pandemic. The authors labeled sentiment using two different approaches: manually and automatically using the NLP TextBlob library. We compared the vectorization method used by employing count vectorization, TF-IDF, and a combination of both. The feature vectors were then classified using three classification methods: naïve Bayes, logistic regression, and k-nearest neighbor, for both manual and automatic labeling. To assess the performance of sentiment analysis models, we used accuracy, precision, recall, and F1-score for performance metrics. The best results showed that the Logistic regression classifier with the feature extraction technique that combines count vectorization and TF-IDF provided the best performance for both data with manual and automatic labeling.","PeriodicalId":151254,"journal":{"name":"JUITA : Jurnal Informatika","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of the Impact of Vectorization Methods on Machine Learning-Based Sentiment Analysis of Tweets Regarding Readiness for Offline Learning\",\"authors\":\"Yesi Novaria Kunang, Widya Putri Mentari\",\"doi\":\"10.30595/juita.v11i2.17568\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Twitter users use social media to express emotions about something, whether it is criticism or praise. Analyzing the opinions or sentiments in the tweets that Twitter users send can identify their emotions for a particular topic. This study aims to determine the impact of vectorization methods on public sentiment analysis regarding the readiness for offline learning in Indonesia during the Covid-19 pandemic. The authors labeled sentiment using two different approaches: manually and automatically using the NLP TextBlob library. We compared the vectorization method used by employing count vectorization, TF-IDF, and a combination of both. The feature vectors were then classified using three classification methods: naïve Bayes, logistic regression, and k-nearest neighbor, for both manual and automatic labeling. To assess the performance of sentiment analysis models, we used accuracy, precision, recall, and F1-score for performance metrics. The best results showed that the Logistic regression classifier with the feature extraction technique that combines count vectorization and TF-IDF provided the best performance for both data with manual and automatic labeling.\",\"PeriodicalId\":151254,\"journal\":{\"name\":\"JUITA : Jurnal Informatika\",\"volume\":\"38 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JUITA : Jurnal Informatika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30595/juita.v11i2.17568\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JUITA : Jurnal Informatika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30595/juita.v11i2.17568","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

推特用户使用社交媒体表达对某事的情绪，无论是批评还是赞美。分析推特用户发送的推文中的观点或情绪可以识别他们对特定话题的情绪。本研究旨在确定矢量化方法对公众情绪分析的影响，分析印度尼西亚在 Covid-19 大流行期间线下学习的准备情况。作者使用两种不同的方法对情感进行了标注：人工标注和使用 NLP TextBlob 库自动标注。我们比较了采用计数矢量化、TF-IDF 以及两者结合的矢量化方法。然后，我们使用三种分类方法对特征向量进行了分类：奈夫贝叶斯、逻辑回归和 k-最近邻，并同时进行了手动和自动标注。为了评估情感分析模型的性能，我们使用了准确率、精确度、召回率和 F1 分数作为性能指标。最佳结果显示，逻辑回归分类器结合了计数矢量化和 TF-IDF 的特征提取技术，在手动和自动标注的数据中均表现最佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Analysis of the Impact of Vectorization Methods on Machine Learning-Based Sentiment Analysis of Tweets Regarding Readiness for Offline Learning

Twitter users use social media to express emotions about something, whether it is criticism or praise. Analyzing the opinions or sentiments in the tweets that Twitter users send can identify their emotions for a particular topic. This study aims to determine the impact of vectorization methods on public sentiment analysis regarding the readiness for offline learning in Indonesia during the Covid-19 pandemic. The authors labeled sentiment using two different approaches: manually and automatically using the NLP TextBlob library. We compared the vectorization method used by employing count vectorization, TF-IDF, and a combination of both. The feature vectors were then classified using three classification methods: naïve Bayes, logistic regression, and k-nearest neighbor, for both manual and automatic labeling. To assess the performance of sentiment analysis models, we used accuracy, precision, recall, and F1-score for performance metrics. The best results showed that the Logistic regression classifier with the feature extraction technique that combines count vectorization and TF-IDF provided the best performance for both data with manual and automatic labeling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JUITA : Jurnal Informatika

自引率

0.00%

发文量