Electronic Health Records (EHRs) are comprised of digitally stored patient and population health data. Unfortunately, EHRs are often far from complete, and these incomplete health records are referred to as missingness. Missingness in EHRs is a hindrance factor in utilizing Machine Learning (ML) for data mining and developing decision support applications. Missingness also limits EHRs’ reusability for retrospective clinical studies. In fact, missingness adversely affects the accuracy and reliability of ML models and clinical studies. Imputation is an effective approach to deal with missing values and to improve the reliability of ML models and clinical studies. However, previous imputation studies are spread across different healthcare datasets and are not universally applicable. In addition, there is a lack of studies focusing on the rationale for the imputation of healthcare datasets. Moreover, the quality of imputation methods is often assessed without considering the medical interpretation. In this study, we therefore aim to characterize the impact on the accuracy of different methods for the imputation of cardiac EHRs, specifically from a ML and medical perspective. Two cardiac EHR datasets with missing values for cardiovascular diseases (CVDs) are used. Multiple imputation methods (mean, median, K-nearest neighbor, and multiple variants of iterative imputation) are considered. From an ML perspective, the post-imputation effects are assessed by quantifying the ML models’ capability to classify CVDs. The distribution of clinically interesting variables is evaluated for clinical comprehension. Our study shows that information in missingness and magnitude of variable missingness are the key factors in the selection of imputation methods for diverse EHR-based applications.
扫码关注我们
求助内容:
应助结果提醒方式:
