A practical framework for early detection of diabetes using ensemble machine learning models

Qusay Saihood, Emrullah Sonuç
{"title":"A practical framework for early detection of diabetes using ensemble machine learning models","authors":"Qusay Saihood, Emrullah Sonuç","doi":"10.55730/1300-0632.4013","DOIUrl":null,"url":null,"abstract":"The diagnosis of diabetes, a prevalent global health condition, is crucial for preventing severe complications. In recent years, there has been a growing effort to develop intelligent diagnostic systems for diabetes utilizing machine learning (ML) algorithms. Despite these efforts, achieving high accuracy rates using such systems remains a significant challenge. Recent advancements in ensemble ML methods offer promising opportunities for early detection of diabetes, as they are known to be faster and more cost-effective than traditional approaches. Therefore, this study proposes a practical framework for diagnosing diabetes that involves three stages. The data preprocessing stage encompasses several crucial tasks, including handling missing values, identifying outliers, balancing the data, normalizing the data, and selecting relevant features. Subsequently, the hyperparameters of the ML algorithms are fine-tuned using grid search to improve their performance. In the final stage, the framework employs ensemble techniques such as bagging, boosting, and stacking to combine multiple ML algorithms and further enhance their predictive capability. Pima Indians Diabetes Database open-access dataset was used to test the performance of the proposed models. The experimental results of this framework indicate the superiority of ensemble methods in diagnosing diabetes compared to individual ML models. The stacking method achieved the best accuracy among the ensemble methods, with the stacked random forest (RF) and support vector machine (SVM) model attaining an accuracy of 97.50%. Among the bagging methods, the RF model yielded the highest accuracy, while among the boosting methods, eXtreme Gradient Boosting (XGB) model achieved the highest accuracy rates of 97.20% and 97.10%, respectively. Moreover, our proposed framework outperforms other ML models as confirmed by the comparison. The study has demonstrated that ensemble methods are crucial for accurate diabetes diagnosis, enabling early detection through efficient preprocessing and calibrated models.","PeriodicalId":23352,"journal":{"name":"Turkish J. Electr. Eng. Comput. Sci.","volume":"41 1","pages":"722-738"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish J. Electr. Eng. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55730/1300-0632.4013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The diagnosis of diabetes, a prevalent global health condition, is crucial for preventing severe complications. In recent years, there has been a growing effort to develop intelligent diagnostic systems for diabetes utilizing machine learning (ML) algorithms. Despite these efforts, achieving high accuracy rates using such systems remains a significant challenge. Recent advancements in ensemble ML methods offer promising opportunities for early detection of diabetes, as they are known to be faster and more cost-effective than traditional approaches. Therefore, this study proposes a practical framework for diagnosing diabetes that involves three stages. The data preprocessing stage encompasses several crucial tasks, including handling missing values, identifying outliers, balancing the data, normalizing the data, and selecting relevant features. Subsequently, the hyperparameters of the ML algorithms are fine-tuned using grid search to improve their performance. In the final stage, the framework employs ensemble techniques such as bagging, boosting, and stacking to combine multiple ML algorithms and further enhance their predictive capability. Pima Indians Diabetes Database open-access dataset was used to test the performance of the proposed models. The experimental results of this framework indicate the superiority of ensemble methods in diagnosing diabetes compared to individual ML models. The stacking method achieved the best accuracy among the ensemble methods, with the stacked random forest (RF) and support vector machine (SVM) model attaining an accuracy of 97.50%. Among the bagging methods, the RF model yielded the highest accuracy, while among the boosting methods, eXtreme Gradient Boosting (XGB) model achieved the highest accuracy rates of 97.20% and 97.10%, respectively. Moreover, our proposed framework outperforms other ML models as confirmed by the comparison. The study has demonstrated that ensemble methods are crucial for accurate diabetes diagnosis, enabling early detection through efficient preprocessing and calibrated models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个使用集成机器学习模型进行糖尿病早期检测的实用框架
糖尿病是一种全球普遍存在的健康状况,诊断糖尿病对于预防严重并发症至关重要。近年来,利用机器学习(ML)算法开发糖尿病智能诊断系统的努力越来越多。尽管做出了这些努力,但使用这种系统实现高准确率仍然是一个重大挑战。集成ML方法的最新进展为糖尿病的早期检测提供了有希望的机会,因为它们比传统方法更快,更具成本效益。因此,本研究提出了一个包括三个阶段的糖尿病诊断的实用框架。数据预处理阶段包括几个关键任务,包括处理缺失值、识别异常值、平衡数据、规范化数据和选择相关特征。随后,使用网格搜索对ML算法的超参数进行微调,以提高其性能。在最后阶段,框架采用bagging、boosting和stacking等集成技术,将多个ML算法组合在一起,进一步增强其预测能力。使用皮马印第安人糖尿病数据库开放获取数据集对所提出模型的性能进行了测试。该框架的实验结果表明,与单个ML模型相比,集成方法在诊断糖尿病方面具有优势。在集成方法中,叠加方法的准确率最高,其中叠加随机森林(RF)和支持向量机(SVM)模型的准确率达到97.50%。在套袋方法中,射频模型的准确率最高,而在助推方法中,极限梯度助推(eXtreme Gradient boosting, XGB)模型的准确率最高,分别为97.20%和97.10%。此外,通过比较证实,我们提出的框架优于其他ML模型。该研究表明,集成方法对于准确诊断糖尿病至关重要,可以通过有效的预处理和校准模型进行早期检测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Sensor Array System Based on Electronic Nose to Detect Borax in Meatballs with Artificial Neural Network Comprehensive Overview of Modern Controllers for Synchronous Reluctance Motor Regular Vehicle Spatial Distribution Estimation Based on Machine Learning Optimized Model Torque Prediction Control Strategy for BLDCM Torque Error and Speed Error Reduction System Low Noise Amplifier at 60 GHz Using Low Loss On-Chip Inductors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1