{"title":"Feature Selection and Prediction Model for Type 2 Diabetes in the Chinese Population with Machine Learning","authors":"Jiaqi Hou, Yongsheng Sang, Yuping Liu, Li Lu","doi":"10.1145/3424978.3425085","DOIUrl":null,"url":null,"abstract":"Diabetes is a chronic disease characterized by hyperglycemia. Based on the rising incidence of the disease in recent years, diabetes is affecting more and more families. In 2017 alone, it caused 5 million deaths and cost $850 billion in global healthcare. In this paper, we proposed a method to predict the prevalence of diabetes based on a selected set of features from physical examination data. We used Fisher's score, RFE and decision tree to select features. Random forest, logistic regression, SVM and MLP were used to predict the prevalence of diabetes. EA and Fisher' s score helped us to reduce dimensions. We used random forest to classify diabetes accurately. Our results show that the highest accuracy (0.987) can be achieved by using random forest with 85 features. The prediction accuracy using Fisher's Score with 19 features also reached 0.986. We finally selected 5 features based on our method to form a new dataset for diabetes prediction. The 5 features are fasting plasma glucose, HbA1c, HDL, total cholesterol level and hypertension. The values of accuracy, precision, sensitivity, F1 score, MCC and AUC were 0.977, 0.968, 0.812, 0.883, 0.875, and 0.905, respectively. Results show that our method can be successfully used to select features for diabetes classifier and improve its performance, which will provide support for clinicians to quickly identify diabetes.","PeriodicalId":178822,"journal":{"name":"Proceedings of the 4th International Conference on Computer Science and Application Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Computer Science and Application Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3424978.3425085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Diabetes is a chronic disease characterized by hyperglycemia. Based on the rising incidence of the disease in recent years, diabetes is affecting more and more families. In 2017 alone, it caused 5 million deaths and cost $850 billion in global healthcare. In this paper, we proposed a method to predict the prevalence of diabetes based on a selected set of features from physical examination data. We used Fisher's score, RFE and decision tree to select features. Random forest, logistic regression, SVM and MLP were used to predict the prevalence of diabetes. EA and Fisher' s score helped us to reduce dimensions. We used random forest to classify diabetes accurately. Our results show that the highest accuracy (0.987) can be achieved by using random forest with 85 features. The prediction accuracy using Fisher's Score with 19 features also reached 0.986. We finally selected 5 features based on our method to form a new dataset for diabetes prediction. The 5 features are fasting plasma glucose, HbA1c, HDL, total cholesterol level and hypertension. The values of accuracy, precision, sensitivity, F1 score, MCC and AUC were 0.977, 0.968, 0.812, 0.883, 0.875, and 0.905, respectively. Results show that our method can be successfully used to select features for diabetes classifier and improve its performance, which will provide support for clinicians to quickly identify diabetes.