Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos
{"title":"大规模电子健康记录数据库的广义混合建模:什么是健康的血清钾?","authors":"Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos","doi":"10.21203/rs.3.rs-245946/v1","DOIUrl":null,"url":null,"abstract":"Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Generalized Mixed Modeling in Massive Electronic Health Record Databases: What is a Healthy Serum Potassium?\",\"authors\":\"Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos\",\"doi\":\"10.21203/rs.3.rs-245946/v1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.\",\"PeriodicalId\":409996,\"journal\":{\"name\":\"arXiv: Applications\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv: Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21203/rs.3.rs-245946/v1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv: Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/rs.3.rs-245946/v1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
将电子健康记录(EHR)条目转换为有用的临床推断需要解决由于个体患者中大量重复观察而导致的计算挑战。不幸的是,用于此类分析的实现广义线性混合模型的主要统计环境库在大数据集中的可扩展性很差。主要的计算瓶颈涉及多变量积分的数值计算,即使是最简单的EHR分析也可能涉及数十万或数百万个维度(每个患者一个)。拉普拉斯近似(LA)在glmm理论的发展中起着重要的作用,它可以以可接受的精度逼近高维积分。因此,我们检查了基于拉普拉斯计算的glmm的可扩展性。为此,我们在R包TMB中编码了glmm。TMB通过将LA与算法微分(AD)相结合,以并行化的方式对复杂似然表达式进行数值优化。我们报告了这种方法的可行性,以支持高钾血症基准问题(HKBP)的临床推论。在HKBP中,我们将Cerner Health Facts EHR数据库中所有患者的钾水平及其随时间的轨迹与生存率联系起来。分析HKBP需要计算超过1000万个维度的积分。这个问题的规模远远超出了现有方法的范围。这个问题的主要临床推论是建立了一个人群反应曲线,将钾水平与死亡率联系起来,并估计了人群中个体风险的可变性。根据我们在HKBP上的经验,我们得出结论,LA和AD的结合为使用glmm分析大重复测量数据提供了一种计算效率高的方法。
Generalized Mixed Modeling in Massive Electronic Health Record Databases: What is a Healthy Serum Potassium?
Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.