Debashis Chatterjee, Prithwish Ghosh, Amlan Banerjee, S. S. Das
{"title":"Optimizing machine learning for water safety: A comparative analysis with dimensionality reduction and classifier performance in potability prediction","authors":"Debashis Chatterjee, Prithwish Ghosh, Amlan Banerjee, S. S. Das","doi":"10.1371/journal.pwat.0000259","DOIUrl":null,"url":null,"abstract":"In this study, we investigated the effectiveness of machine learning techniques in predicting water potability based on water quality attributes. Initially, we applied seven classification-based methods directly to the original dataset, yielding varying accuracy scores. Notably, the Support Vector Machine (SVM) achieved the highest accuracy of 69%, while other methods such as XGBoost, k-Nearest Neighbors, Gaussian Naive Bayes, and Random Forest demonstrated competitive performance with scores ranging from 62% to 68%. Subsequently, we employed Principal Component Analysis (PCA) to reduce the dataset’s dimensionality to six principal components, followed by reapplication of the machine learning techniques. The results showed an increase in accuracy across all classifiers, increasing to nearly 100%. This study provides insights into the impact of dimensionality reduction on predictive accuracy and underscores the importance of selecting appropriate techniques for water potability prediction.","PeriodicalId":93672,"journal":{"name":"PLOS water","volume":"2 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS water","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pwat.0000259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we investigated the effectiveness of machine learning techniques in predicting water potability based on water quality attributes. Initially, we applied seven classification-based methods directly to the original dataset, yielding varying accuracy scores. Notably, the Support Vector Machine (SVM) achieved the highest accuracy of 69%, while other methods such as XGBoost, k-Nearest Neighbors, Gaussian Naive Bayes, and Random Forest demonstrated competitive performance with scores ranging from 62% to 68%. Subsequently, we employed Principal Component Analysis (PCA) to reduce the dataset’s dimensionality to six principal components, followed by reapplication of the machine learning techniques. The results showed an increase in accuracy across all classifiers, increasing to nearly 100%. This study provides insights into the impact of dimensionality reduction on predictive accuracy and underscores the importance of selecting appropriate techniques for water potability prediction.