{"title":"Effect of Synthetic Minority Oversampling Technique (SMOTE), Feature Representation, and Classification Algorithm on Imbalanced Sentiment Analysis","authors":"W. Satriaji, R. Kusumaningrum","doi":"10.1109/ICICOS.2018.8621648","DOIUrl":null,"url":null,"abstract":"The comments received on Internet-based online hotel reservation services are an important resource that can be utilised by hotel service providers including hotel managers' and hoteliers' for exercising quality control measures in their hotel reservation service. Importantly this contributes towards increased customer satisfaction and hotel revisits. In this study, Sentiment Analysis (SA) is used to analyse the comments received from customers. However, there are several problems associated with SA such as the unequal number of each class of data (imbalanced datasets), the classification algorithm and the feature representation. Using SMOTE (Synthetic Minority Oversampling Technique) this research aims to investigate how this technique balances the amount of data from each class employing; the Naïve Bayes (NB), Logistic Regression (LR), and Support Vector Machine (SVM) classification algorithms. And also using; term presence (TO), term occurrence (TO), and Term Frequency-Inverse Document Frequency (TF-IDF) feature representations to gauge the effect on the performance of sentiment analysis. The findings from the study found that the use of SMOTE was effective in improving the model's classification performance when data is imbalanced, as evidenced by the average model performance improvement of approximately 12 %. Furthermore, feature representation of TO resulted in an average of 81.68 % of the G-mean Score, followed by TP of 79.89 %, and TF-IDF 79.31 %. As for the classification algorithm, LR resulted in an average score of 81.65 % of the g-mean score, followed by SVM 81.55 %, and NB of 77.68 %.","PeriodicalId":438473,"journal":{"name":"2018 2nd International Conference on Informatics and Computational Sciences (ICICoS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 2nd International Conference on Informatics and Computational Sciences (ICICoS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICOS.2018.8621648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
The comments received on Internet-based online hotel reservation services are an important resource that can be utilised by hotel service providers including hotel managers' and hoteliers' for exercising quality control measures in their hotel reservation service. Importantly this contributes towards increased customer satisfaction and hotel revisits. In this study, Sentiment Analysis (SA) is used to analyse the comments received from customers. However, there are several problems associated with SA such as the unequal number of each class of data (imbalanced datasets), the classification algorithm and the feature representation. Using SMOTE (Synthetic Minority Oversampling Technique) this research aims to investigate how this technique balances the amount of data from each class employing; the Naïve Bayes (NB), Logistic Regression (LR), and Support Vector Machine (SVM) classification algorithms. And also using; term presence (TO), term occurrence (TO), and Term Frequency-Inverse Document Frequency (TF-IDF) feature representations to gauge the effect on the performance of sentiment analysis. The findings from the study found that the use of SMOTE was effective in improving the model's classification performance when data is imbalanced, as evidenced by the average model performance improvement of approximately 12 %. Furthermore, feature representation of TO resulted in an average of 81.68 % of the G-mean Score, followed by TP of 79.89 %, and TF-IDF 79.31 %. As for the classification algorithm, LR resulted in an average score of 81.65 % of the g-mean score, followed by SVM 81.55 %, and NB of 77.68 %.