Merdin Shamal, Salih, Rowaida Khalil, Subhi R. M. Zeebaree, D. A. Zebari, L. M. Abdulrahman, Nasiba Mahdi
{"title":"Diabetic Prediction based on Machine Learning Using PIMA Indian Dataset","authors":"Merdin Shamal, Salih, Rowaida Khalil, Subhi R. M. Zeebaree, D. A. Zebari, L. M. Abdulrahman, Nasiba Mahdi","doi":"10.52783/cana.v31.1008","DOIUrl":null,"url":null,"abstract":"Diabetes mellitus, a chronic condition, causes disruptions in the metabolic processes of carbohydrates, lipids, and proteins. Hyperglycemia, characterised by elevated blood sugar levels, is the primary distinguishing characteristic of all forms of diabetes. Diabetes is a disease that has significantly increased in prevalence due to the contemporary lifestyle. Consequently, it is essential to get an early-stage diagnosis of the illness. When constructing classification models, data pre-processing is a crucial step. The Pima Indian Diabetes dataset, available in the University of California Irvine (UCI) repository, is a challenging dataset with a higher proportion of missing values (48%) compared to comparable datasets. To improve the accuracy of the classification model, many rounds of data pre-processing are conducted on the Pima Diabetes dataset. The proposed approach consists of two stages: outlier removal and imputation in the first stage, and normalisation in the second stage. Regarding the feature aspect, we used a method called principal component analysis (PCA). Ultimately, to classify the PIMA dataset, we used many classifiers such as Support Vector Machine (SVM), Random Forest (RF), Naïve Bayes (NB), and Decision Tree (DT). The testing revealed that the maximum achievable accuracy was 89.86% when 80% of the data was used for training. This was accomplished by integrating the feature selection technique with the classifier.","PeriodicalId":40036,"journal":{"name":"Communications on Applied Nonlinear Analysis","volume":" 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications on Applied Nonlinear Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52783/cana.v31.1008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0
Abstract
Diabetes mellitus, a chronic condition, causes disruptions in the metabolic processes of carbohydrates, lipids, and proteins. Hyperglycemia, characterised by elevated blood sugar levels, is the primary distinguishing characteristic of all forms of diabetes. Diabetes is a disease that has significantly increased in prevalence due to the contemporary lifestyle. Consequently, it is essential to get an early-stage diagnosis of the illness. When constructing classification models, data pre-processing is a crucial step. The Pima Indian Diabetes dataset, available in the University of California Irvine (UCI) repository, is a challenging dataset with a higher proportion of missing values (48%) compared to comparable datasets. To improve the accuracy of the classification model, many rounds of data pre-processing are conducted on the Pima Diabetes dataset. The proposed approach consists of two stages: outlier removal and imputation in the first stage, and normalisation in the second stage. Regarding the feature aspect, we used a method called principal component analysis (PCA). Ultimately, to classify the PIMA dataset, we used many classifiers such as Support Vector Machine (SVM), Random Forest (RF), Naïve Bayes (NB), and Decision Tree (DT). The testing revealed that the maximum achievable accuracy was 89.86% when 80% of the data was used for training. This was accomplished by integrating the feature selection technique with the classifier.