{"title":"The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data","authors":"Zain Jabbar, Peter Washington","doi":"10.3390/biomedinformatics4010043","DOIUrl":null,"url":null,"abstract":"Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioMedInformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biomedinformatics4010043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.
电子健康记录(EHR)提供了大量与预测临床结果相关的患者数据。缺失值的固有存在给建立性能良好的机器学习模型带来了挑战。本文旨在研究各种估算方法对美国国立卫生研究院的 "All of Us "数据集的影响。我们在同一数据集上应用了几种归因技术,如均值替换、常数填充和多重归因,以完成糖尿病预测任务。我们发现,随着数据缺失度的增加,估算值会导致机器学习模型的异方差性能。也就是说,患者测试的缺失值越多,糖尿病模型的 AUROC、F1、精确度、召回率和准确度得分的方差就越大。这凸显了使用电子病历数据进行预测建模的一个关键挑战。这项工作凸显了未来研究的必要性,即在基于电子病历的预测模型中开发减轻缺失数据和异方差影响的方法。