Dinar Ajeng Kristiyanti, Samuel Ady Sanjaya, Vinsencius Christio Tjokro, Jason Suhali
{"title":"Dealing imbalance dataset problem in sentiment analysis of recession in Indonesia","authors":"Dinar Ajeng Kristiyanti, Samuel Ady Sanjaya, Vinsencius Christio Tjokro, Jason Suhali","doi":"10.11591/ijai.v13.i2.pp2060-2072","DOIUrl":null,"url":null,"abstract":"Global recession news dominates social media, particularly in Indonesia, with social news platforms on Twitter generating public responses and re-tweetings on the issue. Mining these opinions from Twitter using a sentiment analysis approach yields invaluable insights. The research stages included data collection, pre-processing, data labeling using the lexical-based method like valence aware dictionary and sentiment reasoner (VADER) and TextBlob, sampling techniques using synthetic minority oversampling technique (SMOTE) and random over sampling (ROS) before and after splitting data, and modeling using machine learning such as support vector machines (SVM), k-nearest neighbour (KNN), naive Bayes, and model evaluation. The problem is that almost 300,000 data collected from NodeXL are unbalanced. The findings show that models with balanced datasets show better model evaluation results. The sampling technique was carried out before and after splitting the data. The model evaluation results show that the Bernoulli-naive Bayes algorithm, with the VADER labeling technique, and the SMOTE sampling technique after splitting data, obtains the best accuracy of 84%, and using the ROS technique obtains an accuracy of 81%. On the other hand, with the SMOTE and ROS technique before splitting data on the SVM algorithm, it gets the best accuracy of 93% from before if only using SVM only reached 84%.","PeriodicalId":507934,"journal":{"name":"IAES International Journal of Artificial Intelligence (IJ-AI)","volume":"8 19","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IAES International Journal of Artificial Intelligence (IJ-AI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11591/ijai.v13.i2.pp2060-2072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Global recession news dominates social media, particularly in Indonesia, with social news platforms on Twitter generating public responses and re-tweetings on the issue. Mining these opinions from Twitter using a sentiment analysis approach yields invaluable insights. The research stages included data collection, pre-processing, data labeling using the lexical-based method like valence aware dictionary and sentiment reasoner (VADER) and TextBlob, sampling techniques using synthetic minority oversampling technique (SMOTE) and random over sampling (ROS) before and after splitting data, and modeling using machine learning such as support vector machines (SVM), k-nearest neighbour (KNN), naive Bayes, and model evaluation. The problem is that almost 300,000 data collected from NodeXL are unbalanced. The findings show that models with balanced datasets show better model evaluation results. The sampling technique was carried out before and after splitting the data. The model evaluation results show that the Bernoulli-naive Bayes algorithm, with the VADER labeling technique, and the SMOTE sampling technique after splitting data, obtains the best accuracy of 84%, and using the ROS technique obtains an accuracy of 81%. On the other hand, with the SMOTE and ROS technique before splitting data on the SVM algorithm, it gets the best accuracy of 93% from before if only using SVM only reached 84%.