A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effects given the response in a GLMM when the response does not follow normal contains intractable integrals. Previous methods rely on Monte Carlo simulations for the posterior distributions. They do not provide the exact posterior mean and covariance of the random effects given the response. The special integral computation (SIC) method is proposed to overcome the difficulty. The SIC method does not use the posterior distribution in the computation. It devises an optimization problem to reach the task. An advantage is that the computation of the posterior distribution is unnecessary. The proposed SIC avoids the main difficulty in Bayesian analysis when intractable integrals appear in the posterior distribution.
本文提出了一种新方法,用于在广义线性混合模型(GLMM)中,当响应不符合正态分布时,给定响应的随机效应的精确后验均值和协方差。这项研究解决了贝叶斯统计中一个长期存在的问题,即在后验分布中出现一个难以解决的积分。众所周知,当 GLMM 中的响应不服从正态分布时,给定响应的随机效应的后验分布包含难以处理的积分。以前的方法依赖蒙特卡洛模拟来计算后验分布。这些方法无法提供给定响应的随机效应的精确后验均值和协方差。为了克服这一困难,我们提出了特殊积分计算(SIC)方法。SIC 方法在计算中不使用后验分布。它设计了一个优化问题来完成任务。其优点是无需计算后验分布。所提出的 SIC 方法避免了贝叶斯分析中的主要困难,即在后验分布中出现难以处理的积分。
{"title":"Exact Posterior Mean and Covariance for Generalized Linear Mixed Models","authors":"Tonglin Zhang","doi":"arxiv-2409.09310","DOIUrl":"https://doi.org/arxiv-2409.09310","url":null,"abstract":"A novel method is proposed for the exact posterior mean and covariance of the\u0000random effects given the response in a generalized linear mixed model (GLMM)\u0000when the response does not follow normal. The research solves a long-standing\u0000problem in Bayesian statistics when an intractable integral appears in the\u0000posterior distribution. It is well-known that the posterior distribution of the\u0000random effects given the response in a GLMM when the response does not follow\u0000normal contains intractable integrals. Previous methods rely on Monte Carlo\u0000simulations for the posterior distributions. They do not provide the exact\u0000posterior mean and covariance of the random effects given the response. The\u0000special integral computation (SIC) method is proposed to overcome the\u0000difficulty. The SIC method does not use the posterior distribution in the\u0000computation. It devises an optimization problem to reach the task. An advantage\u0000is that the computation of the posterior distribution is unnecessary. The\u0000proposed SIC avoids the main difficulty in Bayesian analysis when intractable\u0000integrals appear in the posterior distribution.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the context of spatial econometrics, it is very useful to have methodologies that allow modeling the spatial dependence of the observed variables and obtaining more precise predictions of both the mean and the variability of the response variable, something very useful in territorial planning and public policies. This paper proposes a new methodology that jointly models the mean and the variance. Also, it allows to model the spatial dependence of the dependent variable as a function of covariates and to model the semiparametric effects in both models. The algorithms developed are based on generalized additive models that allow the inclusion of non-parametric terms in both the mean and the variance, maintaining the traditional theoretical framework of spatial regression. The theoretical developments of the estimation of this model are carried out, obtaining desirable statistical properties in the estimators. A simulation study is developed to verify that the proposed method has a remarkable predictive capacity in terms of the mean square error and shows a notable improvement in the estimation of the spatial autoregressive parameter, compared to other traditional methods and some recent developments. The model is also tested on data from the construction of a hedonic price model for the city of Bogota, highlighting as the main result the ability to model the variability of housing prices, and the wealth in the analysis obtained.
{"title":"Joint spatial modeling of mean and non-homogeneous variance combining semiparametric SAR and GAMLSS models for hedonic prices","authors":"J. D. Toloza-Delgado, O. O. Melo, N. A. Cruz","doi":"arxiv-2409.08912","DOIUrl":"https://doi.org/arxiv-2409.08912","url":null,"abstract":"In the context of spatial econometrics, it is very useful to have\u0000methodologies that allow modeling the spatial dependence of the observed\u0000variables and obtaining more precise predictions of both the mean and the\u0000variability of the response variable, something very useful in territorial\u0000planning and public policies. This paper proposes a new methodology that\u0000jointly models the mean and the variance. Also, it allows to model the spatial\u0000dependence of the dependent variable as a function of covariates and to model\u0000the semiparametric effects in both models. The algorithms developed are based\u0000on generalized additive models that allow the inclusion of non-parametric terms\u0000in both the mean and the variance, maintaining the traditional theoretical\u0000framework of spatial regression. The theoretical developments of the estimation\u0000of this model are carried out, obtaining desirable statistical properties in\u0000the estimators. A simulation study is developed to verify that the proposed\u0000method has a remarkable predictive capacity in terms of the mean square error\u0000and shows a notable improvement in the estimation of the spatial autoregressive\u0000parameter, compared to other traditional methods and some recent developments.\u0000The model is also tested on data from the construction of a hedonic price model\u0000for the city of Bogota, highlighting as the main result the ability to model\u0000the variability of housing prices, and the wealth in the analysis obtained.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany
In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm's primary purpose of calculating the multi-class VIM.
在具有多类结果的预测任务中,识别与一个或多个结果类别特别相关的协变量可能非常重要。来自随机森林(RF)的传统变量重要性度量(VIMs),如置换和基尼重要性,侧重于整体预测性能或节点纯度,而不区分类别。因此,预计它们无法区分与类相关的协变量和只区分类群的协变量。我们通过一种名为多森林(MuFs)的新型 RF 变体,引入了一种称为多类 VIM 的 VIM,专门用于识别与类相关的协变量。MuFs 中的树同时使用多向分裂和二元分裂。多向拆分为每个类别生成子节点,使用拆分标准来评估这些节点对各自类别的代表程度。这种设置构成了多类 VIM 的基础,多类 VIM 衡量的是根据这种拆分标准在各自协变量中进行拆分的判别能力。该指标基于二元拆分,评估协变量的一般影响强度,而不考虑其类别相关性。模拟研究表明,多类 VIM 对类相关协变量的排序很高,而传统 VIM 对其他类型协变量的排序也很高。对 121 个数据集的分析表明,MuFs 的预测性能往往略低于传统的 RFs。不过,考虑到该算法的主要目的是计算多类 VIM,这并不是一个限制因素。
{"title":"Multi forests: Variable importance for multi-class outcomes","authors":"Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany","doi":"arxiv-2409.08925","DOIUrl":"https://doi.org/arxiv-2409.08925","url":null,"abstract":"In prediction tasks with multi-class outcomes, identifying covariates\u0000specifically associated with one or more outcome classes can be important.\u0000Conventional variable importance measures (VIMs) from random forests (RFs),\u0000like permutation and Gini importance, focus on overall predictive performance\u0000or node purity, without differentiating between the classes. Therefore, they\u0000can be expected to fail to distinguish class-associated covariates from\u0000covariates that only distinguish between groups of classes. We introduce a VIM\u0000called multi-class VIM, tailored for identifying exclusively class-associated\u0000covariates, via a novel RF variant called multi forests (MuFs). The trees in\u0000MuFs use both multi-way and binary splitting. The multi-way splits generate\u0000child nodes for each class, using a split criterion that evaluates how well\u0000these nodes represent their respective classes. This setup forms the basis of\u0000the multi-class VIM, which measures the discriminatory ability of the splits\u0000performed in the respective covariates with regard to this split criterion.\u0000Alongside the multi-class VIM, we introduce a second VIM, the discriminatory\u0000VIM. This measure, based on the binary splits, assesses the strength of the\u0000general influence of the covariates, irrespective of their\u0000class-associatedness. Simulation studies demonstrate that the multi-class VIM\u0000specifically ranks class-associated covariates highly, unlike conventional VIMs\u0000which also rank other types of covariates highly. Analyses of 121 datasets\u0000reveal that MuFs often have slightly lower predictive performance compared to\u0000conventional RFs. This is, however, not a limiting factor given the algorithm's\u0000primary purpose of calculating the multi-class VIM.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Significant events such as volcanic eruptions can have global and long lasting impacts on climate. These global impacts, however, are not uniform across space and time. Understanding how the Mt. Pinatubo eruption affects global and regional climate is of great interest for predicting impact on climate due to similar events. We propose a Bayesian framework to simultaneously detect and estimate spatially-varying temporal changepoints for regional climate impacts. Our approach takes into account the diffusing nature of the changes caused by the volcanic eruption and leverages spatial correlation. We illustrate our method on simulated datasets and compare it with an existing changepoint detection method. Finally, we apply our method on monthly stratospheric aerosol optical depth and surface temperature data from 1985 to 1995 to detect and estimate changepoints following the 1991 Mt. Pinatubo eruption.
{"title":"Tracing the impacts of Mount Pinatubo eruption on global climate using spatially-varying changepoint detection","authors":"Samantha Shi-Jun, Lyndsay Shand, Bo Li","doi":"arxiv-2409.08908","DOIUrl":"https://doi.org/arxiv-2409.08908","url":null,"abstract":"Significant events such as volcanic eruptions can have global and long\u0000lasting impacts on climate. These global impacts, however, are not uniform\u0000across space and time. Understanding how the Mt. Pinatubo eruption affects\u0000global and regional climate is of great interest for predicting impact on\u0000climate due to similar events. We propose a Bayesian framework to\u0000simultaneously detect and estimate spatially-varying temporal changepoints for\u0000regional climate impacts. Our approach takes into account the diffusing nature\u0000of the changes caused by the volcanic eruption and leverages spatial\u0000correlation. We illustrate our method on simulated datasets and compare it with\u0000an existing changepoint detection method. Finally, we apply our method on\u0000monthly stratospheric aerosol optical depth and surface temperature data from\u00001985 to 1995 to detect and estimate changepoints following the 1991 Mt.\u0000Pinatubo eruption.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Bubel, Jochen Schmid, Maximilian Carmesin, Volodymyr Kozachynskyi, Erik Esche, Michael Bortz
Calibrating model parameters to measured data by minimizing loss functions is an important step in obtaining realistic predictions from model-based approaches, e.g., for process optimization. This is applicable to both knowledge-driven and data-driven model setups. Due to measurement errors, the calibrated model parameters also carry uncertainty. In this contribution, we use cubature formulas based on sparse grids to calculate the variance of the regression results. The number of cubature points is close to the theoretical minimum required for a given level of exactness. We present exact benchmark results, which we also compare to other cubatures. This scheme is then applied to estimate the prediction uncertainty of the NRTL model, calibrated to observations from different experimental designs.
{"title":"Cubature-based uncertainty estimation for nonlinear regression models","authors":"Martin Bubel, Jochen Schmid, Maximilian Carmesin, Volodymyr Kozachynskyi, Erik Esche, Michael Bortz","doi":"arxiv-2409.08756","DOIUrl":"https://doi.org/arxiv-2409.08756","url":null,"abstract":"Calibrating model parameters to measured data by minimizing loss functions is\u0000an important step in obtaining realistic predictions from model-based\u0000approaches, e.g., for process optimization. This is applicable to both\u0000knowledge-driven and data-driven model setups. Due to measurement errors, the\u0000calibrated model parameters also carry uncertainty. In this contribution, we\u0000use cubature formulas based on sparse grids to calculate the variance of the\u0000regression results. The number of cubature points is close to the theoretical\u0000minimum required for a given level of exactness. We present exact benchmark\u0000results, which we also compare to other cubatures. This scheme is then applied\u0000to estimate the prediction uncertainty of the NRTL model, calibrated to\u0000observations from different experimental designs.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many temporal datasets, the parameters of the underlying distribution may change abruptly at unknown times. Detecting these changepoints is crucial for numerous applications. While this problem has been extensively studied for linear data, there has been remarkably less research on bivariate angular data. For the first time, we address the changepoint problem for the mean direction of toroidal and spherical data, which are types of bivariate angular data. By leveraging the intrinsic geometry of a curved torus, we introduce the concept of the ``square'' of an angle. This leads us to define the ``curved dispersion matrix'' for bivariate angular random variables, analogous to the dispersion matrix for bivariate linear random variables. Using this analogous measure of the ``Mahalanobis distance,'' we develop two new non-parametric tests to identify changes in the mean direction parameters for toroidal and spherical distributions. We derive the limiting distributions of the test statistics and evaluate their power surface and contours through extensive simulations. We also apply the proposed methods to detect changes in mean direction for hourly wind-wave direction measurements and the path of the cyclonic storm ``Biporjoy,'' which occurred between 6th and 19th June 2023 over the Arabian Sea, western coast of India.
{"title":"Angular Co-variance using intrinsic geometry of torus: Non-parametric change points detection in meteorological data","authors":"Surojit Biswas, Buddhananda Banerjee, Arnab Kumar Laha","doi":"arxiv-2409.08838","DOIUrl":"https://doi.org/arxiv-2409.08838","url":null,"abstract":"In many temporal datasets, the parameters of the underlying distribution may\u0000change abruptly at unknown times. Detecting these changepoints is crucial for\u0000numerous applications. While this problem has been extensively studied for\u0000linear data, there has been remarkably less research on bivariate angular data.\u0000For the first time, we address the changepoint problem for the mean direction\u0000of toroidal and spherical data, which are types of bivariate angular data. By\u0000leveraging the intrinsic geometry of a curved torus, we introduce the concept\u0000of the ``square'' of an angle. This leads us to define the ``curved dispersion\u0000matrix'' for bivariate angular random variables, analogous to the dispersion\u0000matrix for bivariate linear random variables. Using this analogous measure of\u0000the ``Mahalanobis distance,'' we develop two new non-parametric tests to\u0000identify changes in the mean direction parameters for toroidal and spherical\u0000distributions. We derive the limiting distributions of the test statistics and\u0000evaluate their power surface and contours through extensive simulations. We\u0000also apply the proposed methods to detect changes in mean direction for hourly\u0000wind-wave direction measurements and the path of the cyclonic storm\u0000``Biporjoy,'' which occurred between 6th and 19th June 2023 over the Arabian\u0000Sea, western coast of India.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We show that for any family of distributions with support on [0,1] with strictly monotonic cumulative distribution function (CDF) that has no jumps and is quantile-identifiable (i.e., any two distinct quantiles identify the distribution), knowing the first moment and c-statistic is enough to identify the distribution. The derivations motivate numerical algorithms for mapping a given pair of expected value and c-statistic to the parameters of specified two-parameter distributions for probabilities. We implemented these algorithms in R and in a simulation study evaluated their numerical accuracy for common families of distributions for risks (beta, logit-normal, and probit-normal). An area of application for these developments is in risk prediction modeling (e.g., sample size calculations and Value of Information analysis), where one might need to estimate the parameters of the distribution of predicted risks from the reported summary statistics.
我们证明,对于任何支持[0,1]且具有严格单调累积分布函数(CDF)、无跳跃且可量值化(即任何两个不同的量值可识别该分布)的分布族,知道第一矩和 c 统计量就足以识别该分布。这些推导激发了将给定的一对期望值和 c 统计量映射到指定概率双参数分布参数的数值算法。我们用 R 语言实现了这些算法,并在模拟研究中评估了它们对常见风险分布系列(β、logit-正态分布和 probit-正态分布)的数值精度。这些开发成果的一个应用领域是风险预测建模(如样本大小计算和信息价值分析),在这种情况下,我们可能需要根据报告的汇总统计量来估计预测风险分布的参数。
{"title":"Identification of distributions for risks based on the first moment and c-statistic","authors":"Mohsen Sadatsafavi, Tae Yoon Lee, John Petkau","doi":"arxiv-2409.09178","DOIUrl":"https://doi.org/arxiv-2409.09178","url":null,"abstract":"We show that for any family of distributions with support on [0,1] with\u0000strictly monotonic cumulative distribution function (CDF) that has no jumps and\u0000is quantile-identifiable (i.e., any two distinct quantiles identify the\u0000distribution), knowing the first moment and c-statistic is enough to identify\u0000the distribution. The derivations motivate numerical algorithms for mapping a\u0000given pair of expected value and c-statistic to the parameters of specified\u0000two-parameter distributions for probabilities. We implemented these algorithms\u0000in R and in a simulation study evaluated their numerical accuracy for common\u0000families of distributions for risks (beta, logit-normal, and probit-normal). An\u0000area of application for these developments is in risk prediction modeling\u0000(e.g., sample size calculations and Value of Information analysis), where one\u0000might need to estimate the parameters of the distribution of predicted risks\u0000from the reported summary statistics.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Kley, Yuhan Philip Liu, Hongyuan Cao, Wei Biao Wu
This paper considers the problem of testing and estimation of change point where signals after the change point can be highly irregular, which departs from the existing literature that assumes signals after the change point to be piece-wise constant or vary smoothly. A two-step approach is proposed to effectively estimate the location of the change point. The first step consists of a preliminary estimation of the change point that allows us to obtain unknown parameters for the second step. In the second step we use a new procedure to determine the position of the change point. We show that, under suitable conditions, the desirable $mathcal{O}_P(1)$ rate of convergence of the estimated change point can be obtained. We apply our method to analyze the Baidu search index of COVID-19 related symptoms and find 8~December 2019 to be the starting date of the COVID-19 pandemic.
{"title":"Change point analysis with irregular signals","authors":"Tobias Kley, Yuhan Philip Liu, Hongyuan Cao, Wei Biao Wu","doi":"arxiv-2409.08863","DOIUrl":"https://doi.org/arxiv-2409.08863","url":null,"abstract":"This paper considers the problem of testing and estimation of change point\u0000where signals after the change point can be highly irregular, which departs\u0000from the existing literature that assumes signals after the change point to be\u0000piece-wise constant or vary smoothly. A two-step approach is proposed to\u0000effectively estimate the location of the change point. The first step consists\u0000of a preliminary estimation of the change point that allows us to obtain\u0000unknown parameters for the second step. In the second step we use a new\u0000procedure to determine the position of the change point. We show that, under\u0000suitable conditions, the desirable $mathcal{O}_P(1)$ rate of convergence of\u0000the estimated change point can be obtained. We apply our method to analyze the\u0000Baidu search index of COVID-19 related symptoms and find 8~December 2019 to be\u0000the starting date of the COVID-19 pandemic.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kendrick Li, George C. Linderman, Xu Shi, Eric J. Tchetgen Tchetgen
Unmeasured confounding is one of the major concerns in causal inference from observational data. Proximal causal inference (PCI) is an emerging methodological framework to detect and potentially account for confounding bias by carefully leveraging a pair of negative control exposure (NCE) and outcome (NCO) variables, also known as treatment and outcome confounding proxies. Although regression-based PCI is well developed for binary and continuous outcomes, analogous PCI regression methods for right-censored time-to-event outcomes are currently lacking. In this paper, we propose a novel two-stage regression PCI approach for right-censored survival data under an additive hazard structural model. We provide theoretical justification for the proposed approach tailored to different types of NCOs, including continuous, count, and right-censored time-to-event variables. We illustrate the approach with an evaluation of the effectiveness of right heart catheterization among critically ill patients using data from the SUPPORT study. Our method is implemented in the open-access R package 'pci2s'.
未测量的混杂因素是根据观察数据进行因果推断的主要问题之一。近端因果推断(PCI)是一种新兴的方法论框架,它通过仔细利用一对负控制暴露(NCE)和结果(NCO)变量(也称为治疗和结果混杂代理变量)来检测和潜在地解释混杂偏倚。虽然基于回归的 PCI 已针对二元和连续结果得到了很好的发展,但目前还缺乏针对右删失时间到事件结果的类似 PCI 回归方法。在本文中,我们提出了一种新颖的两阶段回归 PCI 方法,用于加性危害结构模型下的右删失生存数据。我们针对不同类型的 NCO(包括连续变量、计数变量和右删失时间到事件变量),为所提出的方法提供了理论依据。我们利用 SUPPORT 研究的数据对重症患者进行右心导管检查的有效性进行了评估,以此来说明我们的方法。我们的方法是在开放存取的 R 软件包 "pci2s "中实现的。
{"title":"Regression-based proximal causal inference for right-censored time-to-event data","authors":"Kendrick Li, George C. Linderman, Xu Shi, Eric J. Tchetgen Tchetgen","doi":"arxiv-2409.08924","DOIUrl":"https://doi.org/arxiv-2409.08924","url":null,"abstract":"Unmeasured confounding is one of the major concerns in causal inference from\u0000observational data. Proximal causal inference (PCI) is an emerging\u0000methodological framework to detect and potentially account for confounding bias\u0000by carefully leveraging a pair of negative control exposure (NCE) and outcome\u0000(NCO) variables, also known as treatment and outcome confounding proxies.\u0000Although regression-based PCI is well developed for binary and continuous\u0000outcomes, analogous PCI regression methods for right-censored time-to-event\u0000outcomes are currently lacking. In this paper, we propose a novel two-stage\u0000regression PCI approach for right-censored survival data under an additive\u0000hazard structural model. We provide theoretical justification for the proposed\u0000approach tailored to different types of NCOs, including continuous, count, and\u0000right-censored time-to-event variables. We illustrate the approach with an\u0000evaluation of the effectiveness of right heart catheterization among critically\u0000ill patients using data from the SUPPORT study. Our method is implemented in\u0000the open-access R package 'pci2s'.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paola Vesco, David Randahl, Håvard Hegre, Stina Högbladh, Mert Can Yilmaz
Event datasets including those provided by Uppsala Conflict Data Program (UCDP) are based on reports from the media and international organizations, and are likely to suffer from reporting bias. Since the UCDP has strict inclusion criteria, they most likely under-estimate conflict-related deaths, but we do not know by how much. Here, we provide a generalizable, cross-national measure of uncertainty around UCDP reported fatalities that is more robust and realistic than UCDP's documented low and high estimates, and make available a dataset and R package accounting for the measurement uncertainty. We use a structured expert elicitation combined with statistical modelling to derive a distribution of plausible number of fatalities given the number of battle-related deaths and the type of violence documented by the UCDP. The results can help scholars understand the extent of bias affecting their empirical analyses of organized violence and contribute to improve the accuracy of conflict forecasting systems.
包括乌普萨拉冲突数据计划(Uppsala Conflict Data Program,UCDP)提供的事件数据集都是基于媒体和国际组织的报告,很可能存在报告偏差。由于乌普萨拉冲突数据计划有严格的纳入标准,它们很可能低估了与冲突有关的死亡人数,但我们不知道低估了多少。在此,我们提供了一种可通用的、跨国家的、围绕 UCDP 报告的死亡人数的不确定性测量方法,它比 UCDP 有据可查的低估计值和高估计值更稳健、更现实,并提供了测量不确定性的数据集和 R 软件包。我们使用结构化的专家征询法并结合统计建模,根据 UCDP 记录的与战争有关的死亡人数和暴力类型,推导出合理的死亡人数分布。这些结果有助于学者们了解影响有组织暴力实证分析的偏差程度,并有助于提高冲突预测系统的准确性。
{"title":"The underreported death toll of wars: a probabilistic reassessment from a structured expert elicitation","authors":"Paola Vesco, David Randahl, Håvard Hegre, Stina Högbladh, Mert Can Yilmaz","doi":"arxiv-2409.08779","DOIUrl":"https://doi.org/arxiv-2409.08779","url":null,"abstract":"Event datasets including those provided by Uppsala Conflict Data Program\u0000(UCDP) are based on reports from the media and international organizations, and\u0000are likely to suffer from reporting bias. Since the UCDP has strict inclusion\u0000criteria, they most likely under-estimate conflict-related deaths, but we do\u0000not know by how much. Here, we provide a generalizable, cross-national measure\u0000of uncertainty around UCDP reported fatalities that is more robust and\u0000realistic than UCDP's documented low and high estimates, and make available a\u0000dataset and R package accounting for the measurement uncertainty. We use a\u0000structured expert elicitation combined with statistical modelling to derive a\u0000distribution of plausible number of fatalities given the number of\u0000battle-related deaths and the type of violence documented by the UCDP. The\u0000results can help scholars understand the extent of bias affecting their\u0000empirical analyses of organized violence and contribute to improve the accuracy\u0000of conflict forecasting systems.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}