Ibrahim A. Elgendy , Mohamed Hosny , Mousa Ahmad Albashrawi , Shrooq Alsenan
{"title":"Dual-stage explainable ensemble learning model for diabetes diagnosis","authors":"Ibrahim A. Elgendy , Mohamed Hosny , Mousa Ahmad Albashrawi , Shrooq Alsenan","doi":"10.1016/j.eswa.2025.126899","DOIUrl":null,"url":null,"abstract":"<div><div>Early diagnosis of diabetes is crucial for effective management and prevention of complications. However, traditional diagnostic methods are often constrained by the complexity of clinical datasets. To this end, this study proposes a novel explainable machine learning (ML) framework to enhance diabetes prediction. Specifically, the developed methodology involves the detection of outliers using local outlier factor and data reconstruction through a sparse autoencoder. Subsequently, multiple imputation strategies are employed to effectively address missing or erroneous data, while the synthetic minority oversampling technique is applied to mitigate class imbalance. Afterward, a stacking ensemble model, consisting of seven base ML models, is developed for classification, and the outputs of these base models are aggregated using four meta models. To enhance interpretability, two layers of model explainability are implemented. Feature importance analysis is conducted to identify the significance of input variables and Shapley additive explanations is employed to assess the contribution of each base model to the meta model predictions. The results demonstrated that replacing missing data with zeros or mean values led to a noticeable decrease in accuracy compared to K-nearest neighbor imputation or removing samples. Notably, hypertension and kidney failure are pivotal features in the diabetes diagnosis process. Among the base models, Extra Trees model had the most significant impact on the meta model decisions. The stacking multi-layer perceptron model achieved the highest accuracy of 92.54% for diabetes detection, surpassing the performance of standalone ML techniques. This approach enhances diagnostic precision and provides transparency in model predictions, essential for clinical applications.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"274 ","pages":"Article 126899"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425005214","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Early diagnosis of diabetes is crucial for effective management and prevention of complications. However, traditional diagnostic methods are often constrained by the complexity of clinical datasets. To this end, this study proposes a novel explainable machine learning (ML) framework to enhance diabetes prediction. Specifically, the developed methodology involves the detection of outliers using local outlier factor and data reconstruction through a sparse autoencoder. Subsequently, multiple imputation strategies are employed to effectively address missing or erroneous data, while the synthetic minority oversampling technique is applied to mitigate class imbalance. Afterward, a stacking ensemble model, consisting of seven base ML models, is developed for classification, and the outputs of these base models are aggregated using four meta models. To enhance interpretability, two layers of model explainability are implemented. Feature importance analysis is conducted to identify the significance of input variables and Shapley additive explanations is employed to assess the contribution of each base model to the meta model predictions. The results demonstrated that replacing missing data with zeros or mean values led to a noticeable decrease in accuracy compared to K-nearest neighbor imputation or removing samples. Notably, hypertension and kidney failure are pivotal features in the diabetes diagnosis process. Among the base models, Extra Trees model had the most significant impact on the meta model decisions. The stacking multi-layer perceptron model achieved the highest accuracy of 92.54% for diabetes detection, surpassing the performance of standalone ML techniques. This approach enhances diagnostic precision and provides transparency in model predictions, essential for clinical applications.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.