The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data

Zain Jabbar, Peter Washington
{"title":"The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data","authors":"Zain Jabbar, Peter Washington","doi":"10.3390/biomedinformatics4010043","DOIUrl":null,"url":null,"abstract":"Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioMedInformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biomedinformatics4010043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据缺失对机器学习利用所有人数据预测糖尿病失控的影响
电子健康记录(EHR)提供了大量与预测临床结果相关的患者数据。缺失值的固有存在给建立性能良好的机器学习模型带来了挑战。本文旨在研究各种估算方法对美国国立卫生研究院的 "All of Us "数据集的影响。我们在同一数据集上应用了几种归因技术,如均值替换、常数填充和多重归因,以完成糖尿病预测任务。我们发现,随着数据缺失度的增加,估算值会导致机器学习模型的异方差性能。也就是说,患者测试的缺失值越多,糖尿病模型的 AUROC、F1、精确度、召回率和准确度得分的方差就越大。这凸显了使用电子病历数据进行预测建模的一个关键挑战。这项工作凸显了未来研究的必要性,即在基于电子病历的预测模型中开发减轻缺失数据和异方差影响的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.70
自引率
0.00%
发文量
0
期刊最新文献
Cinco de Bio: A Low-Code Platform for Domain-Specific Workflows for Biomedical Imaging Research Approaches to Extracting Patterns of Service Utilization for Patients with Complex Conditions: Graph Community Detection vs. Natural Language Processing Clustering Replies to Queries in Gynecologic Oncology by Bard, Bing and the Google Assistant Should AI-Powered Whole-Genome Sequencing Be Used Routinely for Personalized Decision Support in Surgical Oncology—A Scoping Review Transfer-Learning Approach for Enhanced Brain Tumor Classification in MRI Imaging
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1