Gaussian process regression and classification using International Classification of Disease codes as covariates

IF 0.7 4区数学 Q3 STATISTICS & PROBABILITY Stat Pub Date : 2023-10-07 DOI:10.1002/sta4.618

Sanvesh Srivastava, Zongyi Xu, Yunyi Li, W. Nick Street, Stephanie Gilbertson-White

{"title":"Gaussian process regression and classification using International Classification of Disease codes as covariates","authors":"Sanvesh Srivastava, Zongyi Xu, Yunyi Li, W. Nick Street, Stephanie Gilbertson-White","doi":"10.1002/sta4.618","DOIUrl":null,"url":null,"abstract":"In electronic health records (EHRs) data analysis, nonparametric regression and classification using International Classification of Disease (ICD) codes as covariates remain understudied. Automated methods have been developed over the years for predicting biomedical responses using EHRs, but relatively less attention has been paid to developing patient similarity measures that use ICD codes and chronic conditions, where a chronic condition is defined as a set of ICD codes. We address this problem by first developing a string kernel function for measuring the similarity between a pair of primary chronic conditions, represented as subsets of ICD codes. Second, we extend this similarity measure to a family of covariance functions on subsets of chronic conditions. This family is used in developing Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are tuning free, so they are ideal for automated prediction of biomedical responses depending on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method provides better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"27 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stat","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/sta4.618","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

In electronic health records (EHRs) data analysis, nonparametric regression and classification using International Classification of Disease (ICD) codes as covariates remain understudied. Automated methods have been developed over the years for predicting biomedical responses using EHRs, but relatively less attention has been paid to developing patient similarity measures that use ICD codes and chronic conditions, where a chronic condition is defined as a set of ICD codes. We address this problem by first developing a string kernel function for measuring the similarity between a pair of primary chronic conditions, represented as subsets of ICD codes. Second, we extend this similarity measure to a family of covariance functions on subsets of chronic conditions. This family is used in developing Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are tuning free, so they are ideal for automated prediction of biomedical responses depending on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method provides better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用国际疾病分类代码作为协变量的高斯过程回归和分类

在电子健康记录(EHRs)数据分析中，使用国际疾病分类(ICD)代码作为协变量的非参数回归和分类仍未得到充分研究。多年来，人们已经开发出自动化方法，利用电子病历预测生物医学反应，但相对较少关注使用ICD代码和慢性病开发患者相似性测量，其中慢性病被定义为一组ICD代码。为了解决这个问题，我们首先开发了一个字符串核函数，用于测量一对主要慢性疾病之间的相似性，表示为ICD代码的子集。其次，我们将这种相似性度量扩展到慢性病子集上的协方差函数族。该家族用于开发高斯过程(GP)先验，用于贝叶斯非参数回归和分类，使用诊断和其他人口统计信息作为协变量。马尔可夫链蒙特卡罗(MCMC)算法用于后验推理和预测。所提出的方法是免费调整的，因此它们是根据慢性疾病自动预测生物医学反应的理想选择。我们对来自爱荷华大学医院和诊所(UIHC)的6个不同原发癌症部位的1660名患者的电子病历数据进行了评估。我们的方法在分类不同的原发癌部位和估计慢性疾病与原发癌部位之间的边际关联方面提供了比其竞争对手更好的敏感性和特异性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Stat Decision Sciences-Statistics, Probability and Uncertainty

CiteScore

1.10

自引率

0.00%

发文量

期刊介绍： Stat is an innovative electronic journal for the rapid publication of novel and topical research results, publishing compact articles of the highest quality in all areas of statistical endeavour. Its purpose is to provide a means of rapid sharing of important new theoretical, methodological and applied research. Stat is a joint venture between the International Statistical Institute and Wiley-Blackwell. Stat is characterised by: • Speed - a high-quality review process that aims to reach a decision within 20 days of submission. • Concision - a maximum article length of 10 pages of text, not including references. • Supporting materials - inclusion of electronic supporting materials including graphs, video, software, data and images. • Scope - addresses all areas of statistics and interdisciplinary areas. Stat is a scientific journal for the international community of statisticians and researchers and practitioners in allied quantitative disciplines.