Performance Assessment of Machine Learning Based Models for Diabetes Prediction

R. Deo, S. Panigrahi
{"title":"Performance Assessment of Machine Learning Based Models for Diabetes Prediction","authors":"R. Deo, S. Panigrahi","doi":"10.1109/HI-POCT45284.2019.8962811","DOIUrl":null,"url":null,"abstract":"Diabetes is a major chronic disease which impacts all age groups. It has increasing prevalence worldwide. Certain factors increase the chances of diabetes occurrence in individuals. Prediction-based modeling has been used previously to provide a prevention based approach to diabetes. Prediction models have predominantly been based on regression and feature elimination. In this paper, a machine learning-based approach is presented to predict the individual diabetes occurrence based on specific lifestyle, and demographic factors. A publicly available dataset - continuous NHANES, was used. To account for small data size due to missing data and class imbalanced data, certain statistical techniques were applied. Synthetic minority over sampling technique was used via Gower’s distance calculation to avoid class imbalanced data. Additionally, principal component analysis was used as a feature extraction technique. Predictive models were developed using MATLAB. A dataset with 140 data samples and 11 predictor variables (converted to eight principal components) was used. The output variable had two classes - diabetic and not diabetic. A training data set of 98 and 42 samples for training and testing respectively. Two machine learning models - bagged trees and linear SVM were developed. Two validation techniques - 5- fold cross validation and holdout validation were assessed. The highest accuracy of 91% (90.82%, on test data) was obtained by the linear SVM model using both 5-fold cross validation and hold out validation approaches (AUC of 0.908 in both cases).","PeriodicalId":269346,"journal":{"name":"2019 IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HI-POCT45284.2019.8962811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Diabetes is a major chronic disease which impacts all age groups. It has increasing prevalence worldwide. Certain factors increase the chances of diabetes occurrence in individuals. Prediction-based modeling has been used previously to provide a prevention based approach to diabetes. Prediction models have predominantly been based on regression and feature elimination. In this paper, a machine learning-based approach is presented to predict the individual diabetes occurrence based on specific lifestyle, and demographic factors. A publicly available dataset - continuous NHANES, was used. To account for small data size due to missing data and class imbalanced data, certain statistical techniques were applied. Synthetic minority over sampling technique was used via Gower’s distance calculation to avoid class imbalanced data. Additionally, principal component analysis was used as a feature extraction technique. Predictive models were developed using MATLAB. A dataset with 140 data samples and 11 predictor variables (converted to eight principal components) was used. The output variable had two classes - diabetic and not diabetic. A training data set of 98 and 42 samples for training and testing respectively. Two machine learning models - bagged trees and linear SVM were developed. Two validation techniques - 5- fold cross validation and holdout validation were assessed. The highest accuracy of 91% (90.82%, on test data) was obtained by the linear SVM model using both 5-fold cross validation and hold out validation approaches (AUC of 0.908 in both cases).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于机器学习的糖尿病预测模型的性能评估
糖尿病是一种影响所有年龄组的主要慢性疾病。它在世界范围内越来越流行。某些因素会增加个体患糖尿病的几率。基于预测的建模以前已用于提供基于预防的糖尿病方法。预测模型主要是基于回归和特征消除。本文提出了一种基于机器学习的方法,基于特定的生活方式和人口因素来预测个体糖尿病的发生。使用了一个公开可用的数据集-连续NHANES。为了解释由于缺失数据和类别不平衡数据而导致的小数据量,应用了某些统计技术。通过Gower距离计算,采用合成少数过抽样技术,避免了数据的类不平衡。此外,采用主成分分析作为特征提取技术。利用MATLAB开发预测模型。使用了包含140个数据样本和11个预测变量(转换为8个主成分)的数据集。输出变量分为糖尿病和非糖尿病两类。训练数据集有98个样本和42个样本,分别用于训练和测试。提出了袋装树和线性支持向量机两种机器学习模型。评估了两种验证技术- 5倍交叉验证和保留验证。线性支持向量机模型使用5倍交叉验证和hold out验证方法(两种情况下的AUC均为0.908)获得了91%(90.82%,测试数据)的最高准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Novel Nanoscale Electrode for Biosensing A Motion Free Image Based TRF Reader for Quantitative Immunoassay Gaze-based video games for assessment of attention outside of the lab Conjugated Barcoded Particles for Multiplexed Biomarker Quantification with a Microfluidic Biochip Daily Locomotor Movement Recognition with a Smart Insole and a Pre-defined Route Map: Towards Early Motor Dysfunction Detection*
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1