Archetypal analysis is an exploratory tool that explains a set of observations as mixtures of pure (extreme) patterns. If the patterns are actual observations of the sample, we refer to them as archetypoids. For the first time, we propose to use archetypoid analysis for binary observations. This tool can contribute to the understanding of a binary data set, as in the multivariate case. We illustrate the advantages of the proposed methodology in a simulation study and two applications, one exploring objects (rows) and the other exploring items (columns). One is related to determining student skill set profiles and the other to describing item response functions.
{"title":"Finding archetypal patterns for binary questionnaires","authors":"Ismael Cabero, I. Epifanio","doi":"10.2436/20.8080.02.94","DOIUrl":"https://doi.org/10.2436/20.8080.02.94","url":null,"abstract":"Archetypal analysis is an exploratory tool that explains a set of observations as mixtures of pure (extreme) patterns. If the patterns are actual observations of the sample, we refer to them as archetypoids. For the first time, we propose to use archetypoid analysis for binary observations. This tool can contribute to the understanding of a binary data set, as in the multivariate case. We illustrate the advantages of the proposed methodology in a simulation study and two applications, one exploring objects (rows) and the other exploring items (columns). One is related to determining student skill set profiles and the other to describing item response functions.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132866522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we estimate extreme sea surface temperature (SST) hotspots, i.e., high threshold exceedance regions, for the Red Sea, a vital region of high biodiversity. We analyze high-resolution satellite-derived SST data comprising daily measurements at 16703 grid cells across the Red Sea over the period 1985--2015. We propose a semiparametric Bayesian spatial mixed-effects linear model with a flexible mean structure to capture spatially-varying trend and seasonality, while the residual spatial variability is modeled through a Dirichlet process mixture (DPM) of low-rank spatial Student-$t$ processes (LTPs). By specifying cluster-specific parameters for each LTP mixture component, the bulk of the SST residuals influence tail inference and hotspot estimation only moderately. Our proposed model has a nonstationary mean, covariance and tail dependence, and posterior inference can be drawn efficiently through Gibbs sampling. In our application, we show that the proposed method outperforms some natural parametric and semiparametric alternatives. Moreover, we show how hotspots can be identified and we estimate extreme SST hotspots for the whole Red Sea, projected for the year 2100. The estimated 95% credible region for joint high threshold exceedances include large areas covering major endangered coral reefs in the southern Red Sea.
{"title":"Estimating high-resolution Red Sea surface temperature hotspots, using a low-rank semiparametric spatial model","authors":"A. Hazra, Raphael Huser","doi":"10.1214/20-aoas1418","DOIUrl":"https://doi.org/10.1214/20-aoas1418","url":null,"abstract":"In this work, we estimate extreme sea surface temperature (SST) hotspots, i.e., high threshold exceedance regions, for the Red Sea, a vital region of high biodiversity. We analyze high-resolution satellite-derived SST data comprising daily measurements at 16703 grid cells across the Red Sea over the period 1985--2015. We propose a semiparametric Bayesian spatial mixed-effects linear model with a flexible mean structure to capture spatially-varying trend and seasonality, while the residual spatial variability is modeled through a Dirichlet process mixture (DPM) of low-rank spatial Student-$t$ processes (LTPs). By specifying cluster-specific parameters for each LTP mixture component, the bulk of the SST residuals influence tail inference and hotspot estimation only moderately. Our proposed model has a nonstationary mean, covariance and tail dependence, and posterior inference can be drawn efficiently through Gibbs sampling. In our application, we show that the proposed method outperforms some natural parametric and semiparametric alternatives. Moreover, we show how hotspots can be identified and we estimate extreme SST hotspots for the whole Red Sea, projected for the year 2100. The estimated 95% credible region for joint high threshold exceedances include large areas covering major endangered coral reefs in the southern Red Sea.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"680 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116107542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hou‐Cheng Yang, Lijiang Geng, Yishu Xue, Guanyu Hu
Bayesian spatial modeling of heavy-tailed distributions has become increasingly popular in various areas of science in recent decades. We propose a Weibull regression model with spatial random effects for analyzing extreme economic loss. Model estimation is facilitated by a computationally efficient Bayesian sampling algorithm utilizing the multivariate Log-Gamma distribution. Simulation studies are carried out to demonstrate better empirical performances of the proposed model than the generalized linear mixed effects model. An earthquake data obtained from Yunnan Seismological Bureau, China is analyzed. Logarithm of the Pseudo-marginal likelihood values are obtained to select the optimal model, and Value-at-risk, expected shortfall, and tail-value-at-risk based on posterior predictive distribution of the optimal model are calculated under different confidence levels.
{"title":"Spatial Weibull Regression with Multivariate Log Gamma Process and Its Applications to China Earthquake Economic Loss","authors":"Hou‐Cheng Yang, Lijiang Geng, Yishu Xue, Guanyu Hu","doi":"10.4310/21-SII672","DOIUrl":"https://doi.org/10.4310/21-SII672","url":null,"abstract":"Bayesian spatial modeling of heavy-tailed distributions has become increasingly popular in various areas of science in recent decades. We propose a Weibull regression model with spatial random effects for analyzing extreme economic loss. Model estimation is facilitated by a computationally efficient Bayesian sampling algorithm utilizing the multivariate Log-Gamma distribution. Simulation studies are carried out to demonstrate better empirical performances of the proposed model than the generalized linear mixed effects model. An earthquake data obtained from Yunnan Seismological Bureau, China is analyzed. Logarithm of the Pseudo-marginal likelihood values are obtained to select the optimal model, and Value-at-risk, expected shortfall, and tail-value-at-risk based on posterior predictive distribution of the optimal model are calculated under different confidence levels.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"15 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120902174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: When data are collected subject to a detection limit, observations below the detection limit may be considered censored. In addition, the domain of such observations may be restricted; for example, values may be required to be non-negative. Methods We propose a regression method for censored observations that also accounts for domain restriction. The method finds maximum likelihood estimates assuming an underlying truncated normal distribution. Results: We show that our method, tcensReg, outperforms other methods commonly used for data with detection limits such as Tobit regression and single imputation of the detection limit or half detection limit with respect to bias and mean squared error under a range of simulation settings. We apply our method to analyze vision quality data collected from ophthalmology clinical trials comparing different types of intraocular lenses implanted during cataract surgery. All methods tested returned similar conclusions for non-inferiority testing, but estimates from the tcensReg method suggest that there may be greater mean differences and overall variability. Conclusions: In the presence of detection limits, our new method tcensReg provides a way to incorporate known domain restrictions when modeling limited dependent variables that substantially improves inferences.
{"title":"Modeling Variables with a Detection Limit using a Truncated Normal Distribution with Censoring","authors":"Justin R. Williams, Hyung-Woo Kim, C. Crespi","doi":"10.21203/rs.2.20251/v1","DOIUrl":"https://doi.org/10.21203/rs.2.20251/v1","url":null,"abstract":"\u0000 Background: When data are collected subject to a detection limit, observations below the detection limit may be considered censored. In addition, the domain of such observations may be restricted; for example, values may be required to be non-negative. Methods We propose a regression method for censored observations that also accounts for domain restriction. The method finds maximum likelihood estimates assuming an underlying truncated normal distribution. Results: We show that our method, tcensReg, outperforms other methods commonly used for data with detection limits such as Tobit regression and single imputation of the detection limit or half detection limit with respect to bias and mean squared error under a range of simulation settings. We apply our method to analyze vision quality data collected from ophthalmology clinical trials comparing different types of intraocular lenses implanted during cataract surgery. All methods tested returned similar conclusions for non-inferiority testing, but estimates from the tcensReg method suggest that there may be greater mean differences and overall variability. Conclusions: In the presence of detection limits, our new method tcensReg provides a way to incorporate known domain restrictions when modeling limited dependent variables that substantially improves inferences.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114256832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Sanchez, Patricia Lorenzo-Luaces, C. Fonte, A. Lage
Progress in immunotherapy revolutionized the treatment landscape for advanced lung cancer, raising survival expectations beyond those that were historically anticipated with this disease. In the present study, we describe the methods for the adjustment of mixture parametric models of two populations for survival analysis in the presence of long survivors. A methodology is proposed in several five steps: first, it is proposed to use the multimodality test to decide the number of subpopulations to be considered in the model, second to adjust simple parametric survival models and mixture distribution models, to estimate the parameters and to select the best model fitted the data, finally, to test the hypotheses to compare the effectiveness of immunotherapies in the context of randomized clinical trials. The methodology is illustrated with data from a clinical trial that evaluates the effectiveness of the therapeutic vaccine CIMAvaxEGF vs the best supportive care for the treatment of advanced lung cancer. The mixture survival model allows estimating the presence of a subpopulation of long survivors that is 44% for vaccinated patients. The differences between the treated and control group were significant in both subpopulations (population of short-term survival: p = 0.001, the population of long-term survival: p = 0.0002). For cancer therapies, where a proportion of patients achieves long-term control of the disease, the heterogeneity of the population must be taken into account. Mixture parametric models may be more suitable to detect the effectiveness of immunotherapies compared to standard models.
{"title":"Mixture survival models methodology: an application to cancer immunotherapy assessment in clinical trials","authors":"L. Sanchez, Patricia Lorenzo-Luaces, C. Fonte, A. Lage","doi":"10.21203/rs.2.18321/v1","DOIUrl":"https://doi.org/10.21203/rs.2.18321/v1","url":null,"abstract":"\u0000 Progress in immunotherapy revolutionized the treatment landscape for advanced lung cancer, raising survival expectations beyond those that were historically anticipated with this disease. In the present study, we describe the methods for the adjustment of mixture parametric models of two populations for survival analysis in the presence of long survivors. A methodology is proposed in several five steps: first, it is proposed to use the multimodality test to decide the number of subpopulations to be considered in the model, second to adjust simple parametric survival models and mixture distribution models, to estimate the parameters and to select the best model fitted the data, finally, to test the hypotheses to compare the effectiveness of immunotherapies in the context of randomized clinical trials. The methodology is illustrated with data from a clinical trial that evaluates the effectiveness of the therapeutic vaccine CIMAvaxEGF vs the best supportive care for the treatment of advanced lung cancer. The mixture survival model allows estimating the presence of a subpopulation of long survivors that is 44% for vaccinated patients. The differences between the treated and control group were significant in both subpopulations (population of short-term survival: p = 0.001, the population of long-term survival: p = 0.0002). For cancer therapies, where a proportion of patients achieves long-term control of the disease, the heterogeneity of the population must be taken into account. Mixture parametric models may be more suitable to detect the effectiveness of immunotherapies compared to standard models.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131014163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-12DOI: 10.5194/ascmo-6-115-2020
M. Risser, M. Wehner
Traditional approaches for comparing global climate models and observational data products typically fail to account for the geographic location of the underlying weather station data. For modern high-resolution models, this is an oversight since there are likely grid cells where the physical output of a climate model is compared with a statistically interpolated quantity instead of actual measurements of the climate system. In this paper, we quantify the impact of geographic sampling on the relative performance of high resolution climate models' representation of precipitation extremes in Boreal winter (DJF) over the contiguous United States (CONUS), comparing model output from five early submissions to the HighResMIP subproject of the CMIP6 experiment. We find that properly accounting for the geographic sampling of weather stations can significantly change the assessment of model performance. Across the models considered, failing to account for sampling impacts the different metrics (extreme bias, spatial pattern correlation, and spatial variability) in different ways (both increasing and decreasing). We argue that the geographic sampling of weather stations should be accounted for in order to yield a more straightforward and appropriate comparison between models and observational data sets, particularly for high resolution models. While we focus on the CONUS in this paper, our results have important implications for other global land regions where the sampling problem is more severe.
{"title":"The effect of geographic sampling on evaluation of extreme precipitation in high-resolution climate models","authors":"M. Risser, M. Wehner","doi":"10.5194/ascmo-6-115-2020","DOIUrl":"https://doi.org/10.5194/ascmo-6-115-2020","url":null,"abstract":"Traditional approaches for comparing global climate models and observational data products typically fail to account for the geographic location of the underlying weather station data. For modern high-resolution models, this is an oversight since there are likely grid cells where the physical output of a climate model is compared with a statistically interpolated quantity instead of actual measurements of the climate system. In this paper, we quantify the impact of geographic sampling on the relative performance of high resolution climate models' representation of precipitation extremes in Boreal winter (DJF) over the contiguous United States (CONUS), comparing model output from five early submissions to the HighResMIP subproject of the CMIP6 experiment. We find that properly accounting for the geographic sampling of weather stations can significantly change the assessment of model performance. Across the models considered, failing to account for sampling impacts the different metrics (extreme bias, spatial pattern correlation, and spatial variability) in different ways (both increasing and decreasing). We argue that the geographic sampling of weather stations should be accounted for in order to yield a more straightforward and appropriate comparison between models and observational data sets, particularly for high resolution models. While we focus on the CONUS in this paper, our results have important implications for other global land regions where the sampling problem is more severe.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-17DOI: 10.21203/rs.3.rs-245946/v1
Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos
Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.
将电子健康记录(EHR)条目转换为有用的临床推断需要解决由于个体患者中大量重复观察而导致的计算挑战。不幸的是,用于此类分析的实现广义线性混合模型的主要统计环境库在大数据集中的可扩展性很差。主要的计算瓶颈涉及多变量积分的数值计算,即使是最简单的EHR分析也可能涉及数十万或数百万个维度(每个患者一个)。拉普拉斯近似(LA)在glmm理论的发展中起着重要的作用,它可以以可接受的精度逼近高维积分。因此,我们检查了基于拉普拉斯计算的glmm的可扩展性。为此,我们在R包TMB中编码了glmm。TMB通过将LA与算法微分(AD)相结合,以并行化的方式对复杂似然表达式进行数值优化。我们报告了这种方法的可行性,以支持高钾血症基准问题(HKBP)的临床推论。在HKBP中,我们将Cerner Health Facts EHR数据库中所有患者的钾水平及其随时间的轨迹与生存率联系起来。分析HKBP需要计算超过1000万个维度的积分。这个问题的规模远远超出了现有方法的范围。这个问题的主要临床推论是建立了一个人群反应曲线,将钾水平与死亡率联系起来,并估计了人群中个体风险的可变性。根据我们在HKBP上的经验,我们得出结论,LA和AD的结合为使用glmm分析大重复测量数据提供了一种计算效率高的方法。
{"title":"Generalized Mixed Modeling in Massive Electronic Health Record Databases: What is a Healthy Serum Potassium?","authors":"Cristian-Sorin Bologa, V. Pankratz, M. Unruh, M. Roumelioti, V. Shah, S. K. Shaffi, Soraya Arzhan, John Cook, C. Argyropoulos","doi":"10.21203/rs.3.rs-245946/v1","DOIUrl":"https://doi.org/10.21203/rs.3.rs-245946/v1","url":null,"abstract":"Converting electronic health record (EHR) entries to useful clinical inferences requires one to address computational challenges due to the large number of repeated observations in individual patients. Unfortunately, the libraries of major statistical environments which implement Generalized Linear Mixed Models for such analyses have been shown to scale poorly in big datasets. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve hundreds of thousands or millions of dimensions (one for each patient). The Laplace Approximation (LA) plays a major role in the development of the theory of GLMMs and it can approximate integrals in high dimensions with acceptable accuracy. We thus examined the scalability of Laplace based calculations for GLMMs. To do so we coded GLMMs in the R package TMB. TMB numerically optimizes complex likelihood expressions in a parallelizable manner by combining the LA with algorithmic differentiation (AD). We report on the feasibility of this approach to support clinical inferences in the HyperKalemia Benchmark Problem (HKBP). In the HKBP we associate potassium levels and their trajectories over time with survival in all patients in the Cerner Health Facts EHR database. Analyzing the HKBP requires the evaluation of an integral in over 10 million dimensions. The scale of this problem puts far beyond the reach of methodologies currently available. The major clinical inferences in this problem is the establishment of a population response curve that relates the potassium level with mortality, and an estimate of the variability of individual risk in the population. Based on our experience on the HKBP we conclude that the combination of the LA and AD offers a computationally efficient approach for the analysis of big repeated measures data with GLMMs.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124551647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Camerlenghi, Bianca Dumitrascu, F. Ferrari, B. Engelhardt, S. Favaro
The problem of maximizing cell type discovery under budget constraints is a fundamental challenge in the collection and the analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and real data alike. HPY-TS code is available at this https URL.
{"title":"Nonparametric Bayesian multiarmed bandits for single-cell experiment design","authors":"F. Camerlenghi, Bianca Dumitrascu, F. Ferrari, B. Engelhardt, S. Favaro","doi":"10.1214/20-aoas1370","DOIUrl":"https://doi.org/10.1214/20-aoas1370","url":null,"abstract":"The problem of maximizing cell type discovery under budget constraints is a fundamental challenge in the collection and the analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and real data alike. HPY-TS code is available at this https URL.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenguang Dai, Duo Chan, P. Huybers, Natesh L. Pillai
Accurate estimates of historical changes in sea surface temperatures (SSTs) and their uncertainties are important for documenting and understanding historical changes in climate. A source of uncertainty that has not previously been quantified in historical SST estimates stems from position errors. A Bayesian inference framework is proposed for quantifying errors in reported positions and their implications for SST estimates. The analysis framework is applied to data from ICOADS3.0 in 1885, a time when astronomical and chronometer estimation of position was common, but predating the use of radio signals. Focus is upon a subset of 943 ship tracks from ICOADS3.0 that report their position every two hours to a precision of 0.01° longitude and latitude. These data are interpreted as positions determined by dead reckoning that are periodically updated by celestial correction techniques. The 90% posterior probability intervals for two-hourly dead reckoning uncertainties are (9.90%, 11.10%) for ship speed and (10.43°, 11.92°) for ship heading, leading to position uncertainties that average 0.29° (32 km on the equator) in longitude and 0.20° (22 km) in latitude. Reported ship tracks also contain systematic position uncertainties relating to precursor dead-reckoning positions not being updated after obtaining celestial position estimates, indicating that more accurate positions can be provided for SST observations. Finally, we translate position errors into SST uncertainties by sampling an ensemble of SSTs from MURSST data set. Evolving technology for determining ship position, heterogeneous reporting and archiving of position information, and seasonal and spatial changes in navigational uncertainty and SST gradients together imply that accounting for positional error in SST estimates over the span of the instrumental record will require substantial additional effort.
{"title":"Late 19th century navigational uncertainties and their influence on sea surface temperature estimates","authors":"Chenguang Dai, Duo Chan, P. Huybers, Natesh L. Pillai","doi":"10.1214/20-AOAS1367","DOIUrl":"https://doi.org/10.1214/20-AOAS1367","url":null,"abstract":"Accurate estimates of historical changes in sea surface temperatures (SSTs) and their uncertainties are important for documenting and understanding historical changes in climate. A source of uncertainty that has not previously been quantified in historical SST estimates stems from position errors. A Bayesian inference framework is proposed for quantifying errors in reported positions and their implications for SST estimates. The analysis framework is applied to data from ICOADS3.0 in 1885, a time when astronomical and chronometer estimation of position was common, but predating the use of radio signals. Focus is upon a subset of 943 ship tracks from ICOADS3.0 that report their position every two hours to a precision of 0.01° longitude and latitude. These data are interpreted as positions determined by dead reckoning that are periodically updated by celestial correction techniques. The 90% posterior probability intervals for two-hourly dead reckoning uncertainties are (9.90%, 11.10%) for ship speed and (10.43°, 11.92°) for ship heading, leading to position uncertainties that average 0.29° (32 km on the equator) in longitude and 0.20° (22 km) in latitude. Reported ship tracks also contain systematic position uncertainties relating to precursor dead-reckoning positions not being updated after obtaining celestial position estimates, indicating that more accurate positions can be provided for SST observations. Finally, we translate position errors into SST uncertainties by sampling an ensemble of SSTs from MURSST data set. Evolving technology for determining ship position, heterogeneous reporting and archiving of position information, and seasonal and spatial changes in navigational uncertainty and SST gradients together imply that accounting for positional error in SST estimates over the span of the instrumental record will require substantial additional effort.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130417876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Selecting the optimal Markowitz porfolio depends on estimating the covariance matrix of the returns of $N$ assets from $T$ periods of historical data. Problematically, $N$ is typically of the same order as $T$, which makes the sample covariance matrix estimator perform poorly, both empirically and theoretically. While various other general purpose covariance matrix estimators have been introduced in the financial economics and statistics literature for dealing with the high dimensionality of this problem, we here propose an estimator that exploits the fact that assets are typically positively dependent. This is achieved by imposing that the joint distribution of returns be multivariate totally positive of order 2 ($text{MTP}_2$). This constraint on the covariance matrix not only enforces positive dependence among the assets, but also regularizes the covariance matrix, leading to desirable statistical properties such as sparsity. Based on stock-market data spanning over thirty years, we show that estimating the covariance matrix under $text{MTP}_2$ outperforms previous state-of-the-art methods including shrinkage estimators and factor models.
{"title":"Covariance Matrix Estimation under Total Positivity for Portfolio Selection*","authors":"Raj Agrawal, Uma Roy, Caroline Uhler","doi":"10.1093/jjfinec/nbaa018","DOIUrl":"https://doi.org/10.1093/jjfinec/nbaa018","url":null,"abstract":"Selecting the optimal Markowitz porfolio depends on estimating the covariance matrix of the returns of $N$ assets from $T$ periods of historical data. Problematically, $N$ is typically of the same order as $T$, which makes the sample covariance matrix estimator perform poorly, both empirically and theoretically. While various other general purpose covariance matrix estimators have been introduced in the financial economics and statistics literature for dealing with the high dimensionality of this problem, we here propose an estimator that exploits the fact that assets are typically positively dependent. This is achieved by imposing that the joint distribution of returns be multivariate totally positive of order 2 ($text{MTP}_2$). This constraint on the covariance matrix not only enforces positive dependence among the assets, but also regularizes the covariance matrix, leading to desirable statistical properties such as sparsity. Based on stock-market data spanning over thirty years, we show that estimating the covariance matrix under $text{MTP}_2$ outperforms previous state-of-the-art methods including shrinkage estimators and factor models.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126087538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}