Sentiment Analysis On YouTube Comments Using Word2Vec and Random Forest

Telematika Pub Date : 2021-03-16 DOI:10.31315/TELEMATIKA.V18I1.4493

S. Khomsah

{"title":"Sentiment Analysis On YouTube Comments Using Word2Vec and Random Forest","authors":"S. Khomsah","doi":"10.31315/TELEMATIKA.V18I1.4493","DOIUrl":null,"url":null,"abstract":"Purpose: This study aims to determine the accuracy of sentiment classification using the Random-Forest, and Word2Vec Skip-gram used for features extraction. Word2Vec is one of the effective methods that represent aspects of word meaning and, it helps to improve sentiment classification accuracy.Methodology: The research data consists of 31947 comments downloaded from the YouTube channel for the 2019 presidential election debate. The dataset consists of 23612 positive comments and 8335 negative comments. To avoid bias, we balance the amount of positive and negative data using oversampling. We use Skip-gram to extract features word. The Skip-gram will produce several features around the word the context (input word). Each of these features contains a weight. The feature weight of each comment is calculated by an average-based approach. Random Forest is used to building a sentiment classification model. Experiments were carried out several times with different epoch and window parameters. The performance of each model experiment was measured by cross-validation.Result: Experiments using epochs 1, 5, and 20 and window sizes of 3, 5, and 10, obtain the average accuracy of the model is 90.1% to 91%. However, the results of testing reach an accuracy between 88.77% and 89.05%. But accuracy of the model little bit lower than the accuracy model also was not significant. In the next experiment, it recommended using the number of epochs and the window size greater than twenty epochs and ten windows, so that accuracy increasing significantly.Value: The number of epoch and window sizes on the Skip-Gram affect accuracy. More and more epoch and window sizes affect increasing the accuracy.","PeriodicalId":31716,"journal":{"name":"Telematika","volume":"35 1","pages":"61"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Telematika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31315/TELEMATIKA.V18I1.4493","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Purpose: This study aims to determine the accuracy of sentiment classification using the Random-Forest, and Word2Vec Skip-gram used for features extraction. Word2Vec is one of the effective methods that represent aspects of word meaning and, it helps to improve sentiment classification accuracy.Methodology: The research data consists of 31947 comments downloaded from the YouTube channel for the 2019 presidential election debate. The dataset consists of 23612 positive comments and 8335 negative comments. To avoid bias, we balance the amount of positive and negative data using oversampling. We use Skip-gram to extract features word. The Skip-gram will produce several features around the word the context (input word). Each of these features contains a weight. The feature weight of each comment is calculated by an average-based approach. Random Forest is used to building a sentiment classification model. Experiments were carried out several times with different epoch and window parameters. The performance of each model experiment was measured by cross-validation.Result: Experiments using epochs 1, 5, and 20 and window sizes of 3, 5, and 10, obtain the average accuracy of the model is 90.1% to 91%. However, the results of testing reach an accuracy between 88.77% and 89.05%. But accuracy of the model little bit lower than the accuracy model also was not significant. In the next experiment, it recommended using the number of epochs and the window size greater than twenty epochs and ten windows, so that accuracy increasing significantly.Value: The number of epoch and window sizes on the Skip-Gram affect accuracy. More and more epoch and window sizes affect increasing the accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用Word2Vec和随机森林对YouTube评论进行情感分析

目的:本研究旨在确定使用Random-Forest和Word2Vec Skip-gram进行特征提取的情感分类的准确性。Word2Vec是表征词义方面的有效方法之一，有助于提高情感分类的准确率。研究方法:研究数据包括从2019年总统选举辩论YouTube频道下载的31947条评论。该数据集由23612条正面评论和8335条负面评论组成。为了避免偏差，我们使用过采样来平衡正数据和负数据的数量。我们使用Skip-gram提取特征词。Skip-gram将围绕上下文(输入词)的单词生成几个特征。每个特征都包含一个权重。每个评论的特征权重通过基于平均的方法计算。利用随机森林建立情感分类模型。用不同的历元和窗口参数进行了多次实验。每个模型实验的性能通过交叉验证来衡量。结果:使用epoch 1、5、20,window size 3、5、10进行实验，模型的平均准确率为90.1% ~ 91%。但检测结果的准确率在88.77% ~ 89.05%之间。但模型的精度略低于模型的精度也不显著。在接下来的实验中，建议使用大于20个epoch和10个窗口的epoch数和窗口大小，这样可以显著提高精度。值:Skip-Gram上epoch的数目和窗口大小影响精度。越来越多的历元和窗口大小影响精度的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Telematika

自引率

0.00%

发文量

审稿时长

24 weeks