Analisis Komparatif Algoritme Machine Learning dan Penanganan Imbalanced Data pada Klasifikasi Kualitas Air Layak Minum

KONSTELASI: Konvergensi Teknologi dan Sistem Informasi Pub Date : 2022-04-21 DOI:10.24002/konstelasi.v2i1.5630

Generosa Lukhayu Pritalia

{"title":"Analisis Komparatif Algoritme Machine Learning dan Penanganan Imbalanced Data pada Klasifikasi Kualitas Air Layak Minum","authors":"Generosa Lukhayu Pritalia","doi":"10.24002/konstelasi.v2i1.5630","DOIUrl":null,"url":null,"abstract":" \nAbstract. Water is essential for survival. Currently, there are requirements to monitor, assess, and classify water quality to understand the impact of industrialization. The water quality classification process has been carried out using traditional methods such as WQI and Storet, and machine learning methods. Imbalanced data in machine learning method can make this method have a tendency to predict the majority class and become biased. In addition, using all features in the classification process can degrade classification performance and lead to high computation time. To overcome the above-mentioned problems, this study proposes several approaches, included resampling the data to be balanced, determined the most suitable and contributing features, and compared the performance of machine learning algorithms in classifying potable water. The results of handling unbalanced data and implementing feature selection were able to provide increased work on the algorithm, especially the accuracy metric reached 24.8% from previous study. The most optimal algorithm performance was obtained from Random Forest with 87% of precision, 84% of recall, 16% of Miss rate, 85% of F-measure, and 85% of test accuracy, while used seven best features. However, another important aspect is the smallest Miss rate, which was 15%, obtained from Decision Tree algorithm. \n ","PeriodicalId":163388,"journal":{"name":"KONSTELASI: Konvergensi Teknologi dan Sistem Informasi","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"KONSTELASI: Konvergensi Teknologi dan Sistem Informasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24002/konstelasi.v2i1.5630","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract. Water is essential for survival. Currently, there are requirements to monitor, assess, and classify water quality to understand the impact of industrialization. The water quality classification process has been carried out using traditional methods such as WQI and Storet, and machine learning methods. Imbalanced data in machine learning method can make this method have a tendency to predict the majority class and become biased. In addition, using all features in the classification process can degrade classification performance and lead to high computation time. To overcome the above-mentioned problems, this study proposes several approaches, included resampling the data to be balanced, determined the most suitable and contributing features, and compared the performance of machine learning algorithms in classifying potable water. The results of handling unbalanced data and implementing feature selection were able to provide increased work on the algorithm, especially the accuracy metric reached 24.8% from previous study. The most optimal algorithm performance was obtained from Random Forest with 87% of precision, 84% of recall, 16% of Miss rate, 85% of F-measure, and 85% of test accuracy, while used seven best features. However, another important aspect is the smallest Miss rate, which was 15%, obtained from Decision Tree algorithm.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对可饮用水质量分类的数据补偿分析

摘要水是生存所必需的。目前，有必要对水质进行监测、评估和分类，以了解工业化的影响。水质分类过程使用WQI和Storet等传统方法以及机器学习方法进行。机器学习方法中的数据不平衡会使该方法有预测多数类的倾向，从而产生偏差。此外，在分类过程中使用所有特征会降低分类性能并导致较高的计算时间。为了克服上述问题，本研究提出了几种方法，包括对待平衡数据进行重新采样，确定最合适和最有贡献的特征，并比较机器学习算法在饮用水分类中的性能。通过对不平衡数据的处理和特征选择的实现，提高了算法的工作效率，准确率达到了24.8%。随机森林在使用7个最佳特征的情况下，获得了87%的准确率、84%的召回率、16%的缺失率、85%的F-measure和85%的测试准确率的最优算法性能。然而，另一个重要方面是决策树算法的最小缺失率为15%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

KONSTELASI: Konvergensi Teknologi dan Sistem Informasi

自引率

0.00%

发文量