Predicting Missing Values in Medical Data via XGBoost Regression.

IF 5.9 Q1 Computer Science Journal of Healthcare Informatics Research Pub Date : 2020-12-01 Epub Date: 2020-08-03 DOI:10.1007/s41666-020-00077-1
Xinmeng Zhang, Chao Yan, Cheng Gao, Bradley A Malin, You Chen
{"title":"Predicting Missing Values in Medical Data via XGBoost Regression.","authors":"Xinmeng Zhang,&nbsp;Chao Yan,&nbsp;Cheng Gao,&nbsp;Bradley A Malin,&nbsp;You Chen","doi":"10.1007/s41666-020-00077-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The data in a patient's laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information 1) in individual or 2) between laboratory test variables.</p><p><strong>Method: </strong>We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8,200 patients' records in the MIMIC-III dataset.</p><p><strong>Result: </strong>The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average.</p><p><strong>Conclusion: </strong>Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.</p>","PeriodicalId":36444,"journal":{"name":"Journal of Healthcare Informatics Research","volume":"4 4","pages":"383-394"},"PeriodicalIF":5.9000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s41666-020-00077-1","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41666-020-00077-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/8/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 37

Abstract

Purpose: The data in a patient's laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information 1) in individual or 2) between laboratory test variables.

Method: We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8,200 patients' records in the MIMIC-III dataset.

Result: The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average.

Conclusion: Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于XGBoost回归的医疗数据缺失值预测
目的:病人化验结果中的数据是支持临床调查和加强医学研究的重要资源。然而,由于各种原因,这种类型的数据通常包含大量的缺失值。例如,医生可能会忽略安排检查或记录结果。这种现象降低了利用这些数据学习高效和有效的预测模型的程度。为了解决这个问题,已经开发了各种方法来计算缺失的实验室值;然而,他们的表现有限。这部分是由于没有任何方法能够有效地利用上下文信息(1)在个体中或2)在实验室测试变量之间)。方法:我们引入了一种将无监督预填充策略与有监督机器学习方法相结合的方法,以极端梯度增强(XGBoost)的形式,利用两种类型的上下文进行imputation。我们通过对MIMIC-III数据集中约8,200例患者记录的一系列实验来评估该方法。结果:结果表明,新模型在13个常用的实验室测试变量上优于基线和最先进的模型。在标准化均方根推导(nRMSD)方面,我们的模型平均显示出超过20%的imputation改进。结论:通过同时利用纵向和横向背景的预填充策略和监督训练技术,可以在很大程度上改善时间变量的缺失数据输入。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Healthcare Informatics Research
Journal of Healthcare Informatics Research Computer Science-Computer Science Applications
CiteScore
13.60
自引率
1.70%
发文量
12
期刊介绍: Journal of Healthcare Informatics Research serves as a publication venue for the innovative technical contributions highlighting analytics, systems, and human factors research in healthcare informatics.Journal of Healthcare Informatics Research is concerned with the application of computer science principles, information science principles, information technology, and communication technology to address problems in healthcare, and everyday wellness. Journal of Healthcare Informatics Research highlights the most cutting-edge technical contributions in computing-oriented healthcare informatics.  The journal covers three major tracks: (1) analytics—focuses on data analytics, knowledge discovery, predictive modeling; (2) systems—focuses on building healthcare informatics systems (e.g., architecture, framework, design, engineering, and application); (3) human factors—focuses on understanding users or context, interface design, health behavior, and user studies of healthcare informatics applications.   Topics include but are not limited to: ·         healthcare software architecture, framework, design, and engineering;·         electronic health records·         medical data mining·         predictive modeling·         medical information retrieval·         medical natural language processing·         healthcare information systems·         smart health and connected health·         social media analytics·         mobile healthcare·         medical signal processing·         human factors in healthcare·         usability studies in healthcare·         user-interface design for medical devices and healthcare software·         health service delivery·         health games·         security and privacy in healthcare·         medical recommender system·         healthcare workflow management·         disease profiling and personalized treatment·         visualization of medical data·         intelligent medical devices and sensors·         RFID solutions for healthcare·         healthcare decision analytics and support systems·         epidemiological surveillance systems and intervention modeling·         consumer and clinician health information needs, seeking, sharing, and use·         semantic Web, linked data, and ontology·         collaboration technologies for healthcare·         assistive and adaptive ubiquitous computing technologies·         statistics and quality of medical data·         healthcare delivery in developing countries·         health systems modeling and simulation·         computer-aided diagnosis
期刊最新文献
Extracting Pulmonary Nodules and Nodule Characteristics from Radiology Reports of Lung Cancer Screening Patients Using Transformer Models Clinical Information Retrieval: A Literature Review Supporting Fair and Efficient Emergency Medical Services in a Large Heterogeneous Region Validation of Electrocardiogram Based Photoplethysmogram Generated Using U-Net Based Generative Adversarial Networks Depression Detection on Social Media: A Classification Framework and Research Challenges and Opportunities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1