首页 > 最新文献

arXiv: Applications最新文献

英文 中文
Finding archetypal patterns for binary questionnaires 寻找二元问卷的原型模式
Pub Date : 2020-02-28 DOI: 10.2436/20.8080.02.94
Ismael Cabero, I. Epifanio
Archetypal analysis is an exploratory tool that explains a set of observations as mixtures of pure (extreme) patterns. If the patterns are actual observations of the sample, we refer to them as archetypoids. For the first time, we propose to use archetypoid analysis for binary observations. This tool can contribute to the understanding of a binary data set, as in the multivariate case. We illustrate the advantages of the proposed methodology in a simulation study and two applications, one exploring objects (rows) and the other exploring items (columns). One is related to determining student skill set profiles and the other to describing item response functions.
原型分析是一种探索性工具,它将一组观察结果解释为纯(极端)模式的混合物。如果图案是对样品的实际观察,我们称它们为原型。我们首次提出对二元观测使用类原型分析。此工具有助于理解二进制数据集,如在多变量情况下。我们在模拟研究和两个应用中说明了所提出方法的优点,一个探索对象(行),另一个探索项目(列)。一个与确定学生技能集概况有关,另一个与描述项目响应函数有关。
{"title":"Finding archetypal patterns for binary questionnaires","authors":"Ismael Cabero, I. Epifanio","doi":"10.2436/20.8080.02.94","DOIUrl":"https://doi.org/10.2436/20.8080.02.94","url":null,"abstract":"Archetypal analysis is an exploratory tool that explains a set of observations as mixtures of pure (extreme) patterns. If the patterns are actual observations of the sample, we refer to them as archetypoids. For the first time, we propose to use archetypoid analysis for binary observations. This tool can contribute to the understanding of a binary data set, as in the multivariate case. We illustrate the advantages of the proposed methodology in a simulation study and two applications, one exploring objects (rows) and the other exploring items (columns). One is related to determining student skill set profiles and the other to describing item response functions.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132866522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Estimating high-resolution Red Sea surface temperature hotspots, using a low-rank semiparametric spatial model 利用低阶半参数空间模型估算高分辨率红海表面温度热点
Pub Date : 2019-12-11 DOI: 10.1214/20-aoas1418
A. Hazra, Raphael Huser
In this work, we estimate extreme sea surface temperature (SST) hotspots, i.e., high threshold exceedance regions, for the Red Sea, a vital region of high biodiversity. We analyze high-resolution satellite-derived SST data comprising daily measurements at 16703 grid cells across the Red Sea over the period 1985--2015. We propose a semiparametric Bayesian spatial mixed-effects linear model with a flexible mean structure to capture spatially-varying trend and seasonality, while the residual spatial variability is modeled through a Dirichlet process mixture (DPM) of low-rank spatial Student-$t$ processes (LTPs). By specifying cluster-specific parameters for each LTP mixture component, the bulk of the SST residuals influence tail inference and hotspot estimation only moderately. Our proposed model has a nonstationary mean, covariance and tail dependence, and posterior inference can be drawn efficiently through Gibbs sampling. In our application, we show that the proposed method outperforms some natural parametric and semiparametric alternatives. Moreover, we show how hotspots can be identified and we estimate extreme SST hotspots for the whole Red Sea, projected for the year 2100. The estimated 95% credible region for joint high threshold exceedances include large areas covering major endangered coral reefs in the southern Red Sea.
在这项工作中,我们估计了极端海表温度(SST)热点,即高阈值超出区域,红海是一个重要的生物多样性高区域。我们分析了高分辨率卫星衍生的海温数据,包括1985年至2015年期间红海16703个网格单元的每日测量数据。我们提出了一个具有灵活平均结构的半参数贝叶斯空间混合效应线性模型来捕捉空间变化趋势和季节性,而剩余空间变异性通过低秩空间Student-$t$过程(LTPs)的Dirichlet过程混合(DPM)来建模。通过为每个LTP混合分量指定特定于群集的参数,大部分海温残差仅对尾部推断和热点估计产生适度影响。该模型具有非平稳均值、协方差和尾依赖性,通过Gibbs抽样可以有效地进行后验推理。在我们的应用中,我们证明了该方法优于一些自然参数和半参数替代方法。此外,我们展示了如何识别热点,并估计了整个红海的极端海温热点,预计到2100年。据估计,联合高阈值超标95%可信区域包括覆盖红海南部主要濒危珊瑚礁的大片区域。
{"title":"Estimating high-resolution Red Sea surface temperature hotspots, using a low-rank semiparametric spatial model","authors":"A. Hazra, Raphael Huser","doi":"10.1214/20-aoas1418","DOIUrl":"https://doi.org/10.1214/20-aoas1418","url":null,"abstract":"In this work, we estimate extreme sea surface temperature (SST) hotspots, i.e., high threshold exceedance regions, for the Red Sea, a vital region of high biodiversity. We analyze high-resolution satellite-derived SST data comprising daily measurements at 16703 grid cells across the Red Sea over the period 1985--2015. We propose a semiparametric Bayesian spatial mixed-effects linear model with a flexible mean structure to capture spatially-varying trend and seasonality, while the residual spatial variability is modeled through a Dirichlet process mixture (DPM) of low-rank spatial Student-$t$ processes (LTPs). By specifying cluster-specific parameters for each LTP mixture component, the bulk of the SST residuals influence tail inference and hotspot estimation only moderately. Our proposed model has a nonstationary mean, covariance and tail dependence, and posterior inference can be drawn efficiently through Gibbs sampling. In our application, we show that the proposed method outperforms some natural parametric and semiparametric alternatives. Moreover, we show how hotspots can be identified and we estimate extreme SST hotspots for the whole Red Sea, projected for the year 2100. The estimated 95% credible region for joint high threshold exceedances include large areas covering major endangered coral reefs in the southern Red Sea.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"680 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116107542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Spatial Weibull Regression with Multivariate Log Gamma Process and Its Applications to China Earthquake Economic Loss 多元对数过程空间威布尔回归及其在中国地震经济损失分析中的应用
Pub Date : 2019-12-08 DOI: 10.4310/21-SII672
Hou‐Cheng Yang, Lijiang Geng, Yishu Xue, Guanyu Hu
Bayesian spatial modeling of heavy-tailed distributions has become increasingly popular in various areas of science in recent decades. We propose a Weibull regression model with spatial random effects for analyzing extreme economic loss. Model estimation is facilitated by a computationally efficient Bayesian sampling algorithm utilizing the multivariate Log-Gamma distribution. Simulation studies are carried out to demonstrate better empirical performances of the proposed model than the generalized linear mixed effects model. An earthquake data obtained from Yunnan Seismological Bureau, China is analyzed. Logarithm of the Pseudo-marginal likelihood values are obtained to select the optimal model, and Value-at-risk, expected shortfall, and tail-value-at-risk based on posterior predictive distribution of the optimal model are calculated under different confidence levels.
近几十年来,重尾分布的贝叶斯空间建模在各个科学领域日益流行。本文提出了一个具有空间随机效应的威布尔回归模型,用于分析极端经济损失。利用多变量Log-Gamma分布的计算效率高的贝叶斯抽样算法促进了模型估计。仿真研究表明,该模型的经验性能优于广义线性混合效应模型。对云南地震局的一次地震资料进行了分析。对拟边际似然值取对数,选择最优模型,并在不同置信水平下,计算基于最优模型后验预测分布的风险值、期望缺口和尾部风险值。
{"title":"Spatial Weibull Regression with Multivariate Log Gamma Process and Its Applications to China Earthquake Economic Loss","authors":"Hou‐Cheng Yang, Lijiang Geng, Yishu Xue, Guanyu Hu","doi":"10.4310/21-SII672","DOIUrl":"https://doi.org/10.4310/21-SII672","url":null,"abstract":"Bayesian spatial modeling of heavy-tailed distributions has become increasingly popular in various areas of science in recent decades. We propose a Weibull regression model with spatial random effects for analyzing extreme economic loss. Model estimation is facilitated by a computationally efficient Bayesian sampling algorithm utilizing the multivariate Log-Gamma distribution. Simulation studies are carried out to demonstrate better empirical performances of the proposed model than the generalized linear mixed effects model. An earthquake data obtained from Yunnan Seismological Bureau, China is analyzed. Logarithm of the Pseudo-marginal likelihood values are obtained to select the optimal model, and Value-at-risk, expected shortfall, and tail-value-at-risk based on posterior predictive distribution of the optimal model are calculated under different confidence levels.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"15 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120902174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Modeling Variables with a Detection Limit using a Truncated Normal Distribution with Censoring 用截尾正态分布对具有检测限的变量建模
Pub Date : 2019-11-25 DOI: 10.21203/rs.2.20251/v1
Justin R. Williams, Hyung-Woo Kim, C. Crespi
Background: When data are collected subject to a detection limit, observations below the detection limit may be considered censored. In addition, the domain of such observations may be restricted; for example, values may be required to be non-negative. Methods We propose a regression method for censored observations that also accounts for domain restriction. The method finds maximum likelihood estimates assuming an underlying truncated normal distribution. Results: We show that our method, tcensReg, outperforms other methods commonly used for data with detection limits such as Tobit regression and single imputation of the detection limit or half detection limit with respect to bias and mean squared error under a range of simulation settings. We apply our method to analyze vision quality data collected from ophthalmology clinical trials comparing different types of intraocular lenses implanted during cataract surgery. All methods tested returned similar conclusions for non-inferiority testing, but estimates from the tcensReg method suggest that there may be greater mean differences and overall variability. Conclusions: In the presence of detection limits, our new method tcensReg provides a way to incorporate known domain restrictions when modeling limited dependent variables that substantially improves inferences.
背景:当收集的数据受检测限限制时,低于检测限的观测值可被认为是删节的。此外,这种观察的范围可能受到限制;例如,值可能需要是非负的。我们提出了一种考虑域限制的截尾观测的回归方法。该方法在假设基础截断正态分布的情况下找到最大似然估计。结果:我们表明,在一系列模拟设置下,我们的方法tcensReg优于其他通常用于具有检测限的数据的方法,如Tobit回归和检测限或半检测限的单次输入,涉及偏差和均方误差。我们应用我们的方法来分析从眼科临床试验中收集的视力质量数据,比较不同类型的人工晶状体在白内障手术中植入。对于非劣效性检验,所有测试的方法都得到了类似的结论,但从tcensReg方法的估计表明,可能存在更大的平均差异和总体变异性。结论:在存在检测限的情况下,我们的新方法tcensReg提供了一种在建模有限的因变量时合并已知域限制的方法,从而大大提高了推断。
{"title":"Modeling Variables with a Detection Limit using a Truncated Normal Distribution with Censoring","authors":"Justin R. Williams, Hyung-Woo Kim, C. Crespi","doi":"10.21203/rs.2.20251/v1","DOIUrl":"https://doi.org/10.21203/rs.2.20251/v1","url":null,"abstract":"\u0000 Background: When data are collected subject to a detection limit, observations below the detection limit may be considered censored. In addition, the domain of such observations may be restricted; for example, values may be required to be non-negative. Methods We propose a regression method for censored observations that also accounts for domain restriction. The method finds maximum likelihood estimates assuming an underlying truncated normal distribution. Results: We show that our method, tcensReg, outperforms other methods commonly used for data with detection limits such as Tobit regression and single imputation of the detection limit or half detection limit with respect to bias and mean squared error under a range of simulation settings. We apply our method to analyze vision quality data collected from ophthalmology clinical trials comparing different types of intraocular lenses implanted during cataract surgery. All methods tested returned similar conclusions for non-inferiority testing, but estimates from the tcensReg method suggest that there may be greater mean differences and overall variability. Conclusions: In the presence of detection limits, our new method tcensReg provides a way to incorporate known domain restrictions when modeling limited dependent variables that substantially improves inferences.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114256832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture survival models methodology: an application to cancer immunotherapy assessment in clinical trials 混合生存模型方法学:在临床试验中癌症免疫治疗评估中的应用
Pub Date : 2019-11-21 DOI: 10.21203/rs.2.18321/v1
L. Sanchez, Patricia Lorenzo-Luaces, C. Fonte, A. Lage
Progress in immunotherapy revolutionized the treatment landscape for advanced lung cancer, raising survival expectations beyond those that were historically anticipated with this disease. In the present study, we describe the methods for the adjustment of mixture parametric models of two populations for survival analysis in the presence of long survivors. A methodology is proposed in several five steps: first, it is proposed to use the multimodality test to decide the number of subpopulations to be considered in the model, second to adjust simple parametric survival models and mixture distribution models, to estimate the parameters and to select the best model fitted the data, finally, to test the hypotheses to compare the effectiveness of immunotherapies in the context of randomized clinical trials. The methodology is illustrated with data from a clinical trial that evaluates the effectiveness of the therapeutic vaccine CIMAvaxEGF vs the best supportive care for the treatment of advanced lung cancer. The mixture survival model allows estimating the presence of a subpopulation of long survivors that is 44% for vaccinated patients. The differences between the treated and control group were significant in both subpopulations (population of short-term survival: p = 0.001, the population of long-term survival: p = 0.0002). For cancer therapies, where a proportion of patients achieves long-term control of the disease, the heterogeneity of the population must be taken into account. Mixture parametric models may be more suitable to detect the effectiveness of immunotherapies compared to standard models.
免疫疗法的进步彻底改变了晚期肺癌的治疗前景,提高了这种疾病的生存预期,超出了历史上的预期。在本研究中,我们描述了在长存活者存在的情况下,两个种群的混合参数模型的调整方法。提出了一种方法,分为五个步骤:首先,提出使用多模态检验来确定模型中要考虑的亚群数量;其次,调整简单参数生存模型和混合分布模型,估计参数并选择最适合数据的模型;最后,检验假设以比较随机临床试验背景下免疫疗法的有效性。该方法是用临床试验的数据来说明的,该试验评估了治疗性疫苗CIMAvaxEGF与治疗晚期肺癌的最佳支持治疗的有效性。混合生存模型允许估计长期幸存者亚群的存在,在接种疫苗的患者中占44%。治疗组和对照组之间的差异在两个亚群中都是显著的(短期生存群体:p = 0.001,长期生存群体:p = 0.0002)。对于癌症治疗,当一部分患者实现了疾病的长期控制时,必须考虑到人群的异质性。与标准模型相比,混合参数模型可能更适合于检测免疫疗法的有效性。
{"title":"Mixture survival models methodology: an application to cancer immunotherapy assessment in clinical trials","authors":"L. Sanchez, Patricia Lorenzo-Luaces, C. Fonte, A. Lage","doi":"10.21203/rs.2.18321/v1","DOIUrl":"https://doi.org/10.21203/rs.2.18321/v1","url":null,"abstract":"\u0000 Progress in immunotherapy revolutionized the treatment landscape for advanced lung cancer, raising survival expectations beyond those that were historically anticipated with this disease. In the present study, we describe the methods for the adjustment of mixture parametric models of two populations for survival analysis in the presence of long survivors. A methodology is proposed in several five steps: first, it is proposed to use the multimodality test to decide the number of subpopulations to be considered in the model, second to adjust simple parametric survival models and mixture distribution models, to estimate the parameters and to select the best model fitted the data, finally, to test the hypotheses to compare the effectiveness of immunotherapies in the context of randomized clinical trials. The methodology is illustrated with data from a clinical trial that evaluates the effectiveness of the therapeutic vaccine CIMAvaxEGF vs the best supportive care for the treatment of advanced lung cancer. The mixture survival model allows estimating the presence of a subpopulation of long survivors that is 44% for vaccinated patients. The differences between the treated and control group were significant in both subpopulations (population of short-term survival: p = 0.001, the population of long-term survival: p = 0.0002). For cancer therapies, where a proportion of patients achieves long-term control of the disease, the heterogeneity of the population must be taken into account. Mixture parametric models may be more suitable to detect the effectiveness of immunotherapies compared to standard models.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131014163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The effect of geographic sampling on evaluation of extreme precipitation in high-resolution climate models 地理采样对高分辨率气候模式中极端降水评估的影响
Pub Date : 2019-11-12 DOI: 10.5194/ascmo-6-115-2020
M. Risser, M. Wehner
Traditional approaches for comparing global climate models and observational data products typically fail to account for the geographic location of the underlying weather station data. For modern high-resolution models, this is an oversight since there are likely grid cells where the physical output of a climate model is compared with a statistically interpolated quantity instead of actual measurements of the climate system. In this paper, we quantify the impact of geographic sampling on the relative performance of high resolution climate models' representation of precipitation extremes in Boreal winter (DJF) over the contiguous United States (CONUS), comparing model output from five early submissions to the HighResMIP subproject of the CMIP6 experiment. We find that properly accounting for the geographic sampling of weather stations can significantly change the assessment of model performance. Across the models considered, failing to account for sampling impacts the different metrics (extreme bias, spatial pattern correlation, and spatial variability) in different ways (both increasing and decreasing). We argue that the geographic sampling of weather stations should be accounted for in order to yield a more straightforward and appropriate comparison between models and observational data sets, particularly for high resolution models. While we focus on the CONUS in this paper, our results have important implications for other global land regions where the sampling problem is more severe.
比较全球气候模式和观测数据产品的传统方法通常不能考虑基础气象站数据的地理位置。对于现代高分辨率模式来说,这是一个疏忽,因为可能存在网格单元,其中气候模式的物理输出与统计内插量而不是气候系统的实际测量值进行比较。在本文中,我们通过比较CMIP6实验HighResMIP子项目的5个早期提交的模型输出,量化了地理采样对高分辨率气候模型在美国北部冬季(DJF)极端降水表现的相对性能的影响。我们发现,适当地考虑气象站的地理采样可以显著改变模式性能的评估。在所考虑的模型中,未能考虑抽样以不同的方式(增加和减少)影响不同的度量(极端偏差、空间模式相关性和空间变异性)。我们认为气象站的地理采样应该被考虑在内,以便在模型和观测数据集之间产生更直接和适当的比较,特别是对于高分辨率模型。虽然我们在本文中关注的是CONUS,但我们的结果对其他采样问题更为严重的全球陆地区域具有重要意义。
{"title":"The effect of geographic sampling on evaluation of extreme precipitation in high-resolution climate models","authors":"M. Risser, M. Wehner","doi":"10.5194/ascmo-6-115-2020","DOIUrl":"https://doi.org/10.5194/ascmo-6-115-2020","url":null,"abstract":"Traditional approaches for comparing global climate models and observational data products typically fail to account for the geographic location of the underlying weather station data. For modern high-resolution models, this is an oversight since there are likely grid cells where the physical output of a climate model is compared with a statistically interpolated quantity instead of actual measurements of the climate system. In this paper, we quantify the impact of geographic sampling on the relative performance of high resolution climate models' representation of precipitation extremes in Boreal winter (DJF) over the contiguous United States (CONUS), comparing model output from five early submissions to the HighResMIP subproject of the CMIP6 experiment. We find that properly accounting for the geographic sampling of weather stations can significantly change the assessment of model performance. Across the models considered, failing to account for sampling impacts the different metrics (extreme bias, spatial pattern correlation, and spatial variability) in different ways (both increasing and decreasing). We argue that the geographic sampling of weather stations should be accounted for in order to yield a more straightforward and appropriate comparison between models and observational data sets, particularly for high resolution models. While we focus on the CONUS in this paper, our results have important implications for other global land regions where the sampling problem is more severe.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Generalized Mixed Modeling in Massive Electronic Health Record Databases: What is a Healthy Serum Potassium? 大规模电子健康记录数据库的广义混合建模:什么是健康的血清钾?
Pub Date : 2019-10-17 DOI: 10.21203/rs.3.rs-245946/v1
Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos
Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.
将电子健康记录(EHR)条目转换为有用的临床推断需要解决由于个体患者中大量重复观察而导致的计算挑战。不幸的是,用于此类分析的实现广义线性混合模型的主要统计环境库在大数据集中的可扩展性很差。主要的计算瓶颈涉及多变量积分的数值计算,即使是最简单的EHR分析也可能涉及数十万或数百万个维度(每个患者一个)。拉普拉斯近似(LA)在glmm理论的发展中起着重要的作用,它可以以可接受的精度逼近高维积分。因此,我们检查了基于拉普拉斯计算的glmm的可扩展性。为此,我们在R包TMB中编码了glmm。TMB通过将LA与算法微分(AD)相结合,以并行化的方式对复杂似然表达式进行数值优化。我们报告了这种方法的可行性,以支持高钾血症基准问题(HKBP)的临床推论。在HKBP中,我们将Cerner Health Facts EHR数据库中所有患者的钾水平及其随时间的轨迹与生存率联系起来。分析HKBP需要计算超过1000万个维度的积分。这个问题的规模远远超出了现有方法的范围。这个问题的主要临床推论是建立了一个人群反应曲线,将钾水平与死亡率联系起来,并估计了人群中个体风险的可变性。根据我们在HKBP上的经验,我们得出结论,LA和AD的结合为使用glmm分析大重复测量数据提供了一种计算效率高的方法。
{"title":"Generalized Mixed Modeling in Massive Electronic Health Record Databases: What is a Healthy Serum Potassium?","authors":"Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos","doi":"10.21203/rs.3.rs-245946/v1","DOIUrl":"https://doi.org/10.21203/rs.3.rs-245946/v1","url":null,"abstract":"Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124551647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Nonparametric Bayesian multiarmed bandits for single-cell experiment design 单细胞实验设计的非参数贝叶斯多臂强盗
Pub Date : 2019-10-11 DOI: 10.1214/20-aoas1370
F. Camerlenghi, Bianca Dumitrascu, F. Ferrari, B. Engelhardt, S. Favaro
The problem of maximizing cell type discovery under budget constraints is a fundamental challenge in the collection and the analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and real data alike. HPY-TS code is available at this https URL.
在预算限制下最大限度地发现细胞类型的问题是单细胞rna测序(scRNA-seq)数据收集和分析中的一个基本挑战。在本文中,我们介绍了一种简单、计算效率高、可扩展的贝叶斯非参数序列方法,用于在设计大规模scRNA-seq数据集(但不限于创建细胞图谱)时优化预算分配。我们的方法依赖于i)一个分层的Pitman-Yor先验,它概括了关于细胞分化的生物学假设,以及ii)一个汤普森采样多臂强盗策略,它平衡了开发和探索,从而在一系列试验中优先考虑实验。后验推理使用顺序蒙特卡罗方法进行,这使我们能够充分利用物种抽样问题的顺序性质。我们的经验表明,我们的方法优于最先进的方法,并在模拟和真实数据上实现了接近oracle的性能。HPY-TS代码可在此https URL中获得。
{"title":"Nonparametric Bayesian multiarmed bandits for single-cell experiment design","authors":"F. Camerlenghi, Bianca Dumitrascu, F. Ferrari, B. Engelhardt, S. Favaro","doi":"10.1214/20-aoas1370","DOIUrl":"https://doi.org/10.1214/20-aoas1370","url":null,"abstract":"The problem of maximizing cell type discovery under budget constraints is a fundamental challenge in the collection and the analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and real data alike. HPY-TS code is available at this https URL.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Late 19th century navigational uncertainties and their influence on sea surface temperature estimates 19世纪晚期航海的不确定性及其对海面温度估计的影响
Pub Date : 2019-10-10 DOI: 10.1214/20-AOAS1367
Chenguang Dai, Duo Chan, P. Huybers, Natesh L. Pillai
Accurate estimates of historical changes in sea surface temperatures (SSTs) and their uncertainties are important for documenting and understanding historical changes in climate. A source of uncertainty that has not previously been quantified in historical SST estimates stems from position errors. A Bayesian inference framework is proposed for quantifying errors in reported positions and their implications for SST estimates. The analysis framework is applied to data from ICOADS3.0 in 1885, a time when astronomical and chronometer estimation of position was common, but predating the use of radio signals. Focus is upon a subset of 943 ship tracks from ICOADS3.0 that report their position every two hours to a precision of 0.01° longitude and latitude. These data are interpreted as positions determined by dead reckoning that are periodically updated by celestial correction techniques. The 90% posterior probability intervals for two-hourly dead reckoning uncertainties are (9.90%, 11.10%) for ship speed and (10.43°, 11.92°) for ship heading, leading to position uncertainties that average 0.29° (32 km on the equator) in longitude and 0.20° (22 km) in latitude. Reported ship tracks also contain systematic position uncertainties relating to precursor dead-reckoning positions not being updated after obtaining celestial position estimates, indicating that more accurate positions can be provided for SST observations. Finally, we translate position errors into SST uncertainties by sampling an ensemble of SSTs from MURSST data set. Evolving technology for determining ship position, heterogeneous reporting and archiving of position information, and seasonal and spatial changes in navigational uncertainty and SST gradients together imply that accounting for positional error in SST estimates over the span of the instrumental record will require substantial additional effort.
准确估计海表温度的历史变化及其不确定性对于记录和理解历史气候变化具有重要意义。以前在历史海温估计中未被量化的不确定性来源来自位置误差。提出了一个贝叶斯推理框架,用于量化报告位置的误差及其对海温估计的影响。该分析框架应用于1885年ICOADS3.0的数据,当时天文和天文钟对位置的估计很普遍,但在无线电信号使用之前。重点是来自ICOADS3.0的943条船舶轨迹的子集,每两小时报告一次它们的位置,精度为0.01°经纬度。这些数据被解释为由航位推算确定的位置,并通过天体校正技术定期更新。两小时航迹推算不确定性的90%后验概率区间为船速(9.90%,11.10%)和船首(10.43°,11.92°),导致位置不确定性平均为经度0.29°(赤道32公里)和纬度0.20°(22公里)。报告的船舶航迹还包含系统位置不确定性,这与获得天体位置估计后未更新先兆航位推算位置有关,表明可以为海温观测提供更准确的位置。最后,我们通过从MURSST数据集中采样海温集合,将位置误差转化为海温不确定性。确定船舶位置的技术不断发展,位置信息的异构报告和存档,以及航行不确定性和海温梯度的季节性和空间变化,都意味着在仪器记录的跨度内计算海温估计中的位置误差将需要大量额外的努力。
{"title":"Late 19th century navigational uncertainties and their influence on sea surface temperature estimates","authors":"Chenguang Dai, Duo Chan, P. Huybers, Natesh L. Pillai","doi":"10.1214/20-AOAS1367","DOIUrl":"https://doi.org/10.1214/20-AOAS1367","url":null,"abstract":"Accurate estimates of historical changes in sea surface temperatures (SSTs) and their uncertainties are important for documenting and understanding historical changes in climate. A source of uncertainty that has not previously been quantified in historical SST estimates stems from position errors. A Bayesian inference framework is proposed for quantifying errors in reported positions and their implications for SST estimates. The analysis framework is applied to data from ICOADS3.0 in 1885, a time when astronomical and chronometer estimation of position was common, but predating the use of radio signals. Focus is upon a subset of 943 ship tracks from ICOADS3.0 that report their position every two hours to a precision of 0.01° longitude and latitude. These data are interpreted as positions determined by dead reckoning that are periodically updated by celestial correction techniques. The 90% posterior probability intervals for two-hourly dead reckoning uncertainties are (9.90%, 11.10%) for ship speed and (10.43°, 11.92°) for ship heading, leading to position uncertainties that average 0.29° (32 km on the equator) in longitude and 0.20° (22 km) in latitude. Reported ship tracks also contain systematic position uncertainties relating to precursor dead-reckoning positions not being updated after obtaining celestial position estimates, indicating that more accurate positions can be provided for SST observations. Finally, we translate position errors into SST uncertainties by sampling an ensemble of SSTs from MURSST data set. Evolving technology for determining ship position, heterogeneous reporting and archiving of position information, and seasonal and spatial changes in navigational uncertainty and SST gradients together imply that accounting for positional error in SST estimates over the span of the instrumental record will require substantial additional effort.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130417876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Covariance Matrix Estimation under Total Positivity for Portfolio Selection* 投资组合选择全正性下的协方差矩阵估计*
Pub Date : 2019-09-10 DOI: 10.1093/jjfinec/nbaa018
Raj Agrawal, Uma Roy, Caroline Uhler
Selecting the optimal Markowitz porfolio depends on estimating the covariance matrix of the returns of $N$ assets from $T$ periods of historical data. Problematically, $N$ is typically of the same order as $T$, which makes the sample covariance matrix estimator perform poorly, both empirically and theoretically. While various other general purpose covariance matrix estimators have been introduced in the financial economics and statistics literature for dealing with the high dimensionality of this problem, we here propose an estimator that exploits the fact that assets are typically positively dependent. This is achieved by imposing that the joint distribution of returns be multivariate totally positive of order 2 ($text{MTP}_2$). This constraint on the covariance matrix not only enforces positive dependence among the assets, but also regularizes the covariance matrix, leading to desirable statistical properties such as sparsity. Based on stock-market data spanning over thirty years, we show that estimating the covariance matrix under $text{MTP}_2$ outperforms previous state-of-the-art methods including shrinkage estimators and factor models.
选择最优的马科维茨投资组合依赖于从历史数据的$T$时期估计$N$资产收益的协方差矩阵。问题是,$N$通常与$T$具有相同的阶数,这使得样本协方差矩阵估计器在经验和理论上都表现不佳。虽然金融经济学和统计学文献中已经引入了各种其他通用协方差矩阵估计器来处理这个问题的高维性,但我们在这里提出一个利用资产通常是正相关这一事实的估计器。这是通过假定收益的联合分布是2阶($text{MTP}_2$)的多元全正来实现的。这种对协方差矩阵的约束不仅加强了资产之间的正相关性,而且使协方差矩阵正则化,从而产生理想的统计性质,如稀疏性。基于超过30年的股票市场数据,我们表明在$text{MTP}_2$下估计协方差矩阵优于先前最先进的方法,包括收缩估计器和因子模型。
{"title":"Covariance Matrix Estimation under Total Positivity for Portfolio Selection*","authors":"Raj Agrawal, Uma Roy, Caroline Uhler","doi":"10.1093/jjfinec/nbaa018","DOIUrl":"https://doi.org/10.1093/jjfinec/nbaa018","url":null,"abstract":"Selecting the optimal Markowitz porfolio depends on estimating the covariance matrix of the returns of $N$ assets from $T$ periods of historical data. Problematically, $N$ is typically of the same order as $T$, which makes the sample covariance matrix estimator perform poorly, both empirically and theoretically. While various other general purpose covariance matrix estimators have been introduced in the financial economics and statistics literature for dealing with the high dimensionality of this problem, we here propose an estimator that exploits the fact that assets are typically positively dependent. This is achieved by imposing that the joint distribution of returns be multivariate totally positive of order 2 ($text{MTP}_2$). This constraint on the covariance matrix not only enforces positive dependence among the assets, but also regularizes the covariance matrix, leading to desirable statistical properties such as sparsity. Based on stock-market data spanning over thirty years, we show that estimating the covariance matrix under $text{MTP}_2$ outperforms previous state-of-the-art methods including shrinkage estimators and factor models.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126087538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
期刊
arXiv: Applications
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1