Predicting Missing Values in Medical Data via XGBoost Regression.

IF 5.9 Q1 Computer Science Journal of Healthcare Informatics Research Pub Date : 2020-12-01 Epub Date: 2020-08-03 DOI:10.1007/s41666-020-00077-1

Xinmeng Zhang, Chao Yan, Cheng Gao, Bradley A Malin, You Chen

{"title":"Predicting Missing Values in Medical Data via XGBoost Regression.","authors":"Xinmeng Zhang, Chao Yan, Cheng Gao, Bradley A Malin, You Chen","doi":"10.1007/s41666-020-00077-1","DOIUrl":null,"url":null,"abstract":"Purpose: The data in a patient's laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information 1) in individual or 2) between laboratory test variables.Method: We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8,200 patients' records in the MIMIC-III dataset.Result: The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average.Conclusion: Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.","PeriodicalId":36444,"journal":{"name":"Journal of Healthcare Informatics Research","volume":"4 4","pages":"383-394"},"PeriodicalIF":5.9000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s41666-020-00077-1","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41666-020-00077-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/8/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 37

Abstract

Purpose: The data in a patient's laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information 1) in individual or 2) between laboratory test variables.

Method: We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8,200 patients' records in the MIMIC-III dataset.

Result: The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average.

Conclusion: Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于XGBoost回归的医疗数据缺失值预测

目的:病人化验结果中的数据是支持临床调查和加强医学研究的重要资源。然而，由于各种原因，这种类型的数据通常包含大量的缺失值。例如，医生可能会忽略安排检查或记录结果。这种现象降低了利用这些数据学习高效和有效的预测模型的程度。为了解决这个问题，已经开发了各种方法来计算缺失的实验室值;然而，他们的表现有限。这部分是由于没有任何方法能够有效地利用上下文信息(1)在个体中或2)在实验室测试变量之间)。方法:我们引入了一种将无监督预填充策略与有监督机器学习方法相结合的方法，以极端梯度增强(XGBoost)的形式，利用两种类型的上下文进行imputation。我们通过对MIMIC-III数据集中约8,200例患者记录的一系列实验来评估该方法。结果:结果表明，新模型在13个常用的实验室测试变量上优于基线和最先进的模型。在标准化均方根推导(nRMSD)方面，我们的模型平均显示出超过20%的imputation改进。结论:通过同时利用纵向和横向背景的预填充策略和监督训练技术，可以在很大程度上改善时间变量的缺失数据输入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Healthcare Informatics Research Computer Science-Computer Science Applications

CiteScore

13.60

自引率

1.70%

发文量

期刊介绍： Journal of Healthcare Informatics Research serves as a publication venue for the innovative technical contributions highlighting analytics, systems, and human factors research in healthcare informatics.Journal of Healthcare Informatics Research is concerned with the application of computer science principles, information science principles, information technology, and communication technology to address problems in healthcare, and everyday wellness. Journal of Healthcare Informatics Research highlights the most cutting-edge technical contributions in computing-oriented healthcare informatics. The journal covers three major tracks: (1) analytics—focuses on data analytics, knowledge discovery, predictive modeling; (2) systems—focuses on building healthcare informatics systems (e.g., architecture, framework, design, engineering, and application); (3) human factors—focuses on understanding users or context, interface design, health behavior, and user studies of healthcare informatics applications. Topics include but are not limited to: · healthcare software architecture, framework, design, and engineering;· electronic health records· medical data mining· predictive modeling· medical information retrieval· medical natural language processing· healthcare information systems· smart health and connected health· social media analytics· mobile healthcare· medical signal processing· human factors in healthcare· usability studies in healthcare· user-interface design for medical devices and healthcare software· health service delivery· health games· security and privacy in healthcare· medical recommender system· healthcare workflow management· disease profiling and personalized treatment· visualization of medical data· intelligent medical devices and sensors· RFID solutions for healthcare· healthcare decision analytics and support systems· epidemiological surveillance systems and intervention modeling· consumer and clinician health information needs, seeking, sharing, and use· semantic Web, linked data, and ontology· collaboration technologies for healthcare· assistive and adaptive ubiquitous computing technologies· statistics and quality of medical data· healthcare delivery in developing countries· health systems modeling and simulation· computer-aided diagnosis