{"title":"A Comparison of Supervised Text Classification and Resampling Techniques for User Feedback in Bahasa Indonesia","authors":"Dhammajoti, J. Young, A. Rusli","doi":"10.1109/ICIC50835.2020.9288588","DOIUrl":null,"url":null,"abstract":"User feedback is one of the most important sources of information for improving the quality of software products. Our current research focuses on a software product that is often used in many universities, the E- Learning system. To reduce the effort of manually reading all submitted user feedback, building an automatic text classification using various machine learning approaches is a popular solution. However, there is often a challenge of imbalanced data that could jeopardize the ability of the machine to find the pattern and classify feedback correctly. Several techniques ranging from random resampling of data to artificially creating more data (e.g. SMOTE) have already been proposed for handling imbalanced data and show promising results in terms of performance. This paper aims to implement several numerical representations and implementing resampling techniques (to handling imbalanced data), which then are followed by evaluating some popular supervised machine learning classification algorithms, which are the Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Decision Tree. Finally, evaluating performance with and without using resampling techniques by macro-average F1 Scores. The results show generally the implementation of oversampling techniques leads to better performance, except in a few cases where under-sampling techniques perform better.","PeriodicalId":413610,"journal":{"name":"2020 Fifth International Conference on Informatics and Computing (ICIC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIC50835.2020.9288588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
User feedback is one of the most important sources of information for improving the quality of software products. Our current research focuses on a software product that is often used in many universities, the E- Learning system. To reduce the effort of manually reading all submitted user feedback, building an automatic text classification using various machine learning approaches is a popular solution. However, there is often a challenge of imbalanced data that could jeopardize the ability of the machine to find the pattern and classify feedback correctly. Several techniques ranging from random resampling of data to artificially creating more data (e.g. SMOTE) have already been proposed for handling imbalanced data and show promising results in terms of performance. This paper aims to implement several numerical representations and implementing resampling techniques (to handling imbalanced data), which then are followed by evaluating some popular supervised machine learning classification algorithms, which are the Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Decision Tree. Finally, evaluating performance with and without using resampling techniques by macro-average F1 Scores. The results show generally the implementation of oversampling techniques leads to better performance, except in a few cases where under-sampling techniques perform better.