Pub Date : 2024-09-18DOI: 10.1007/s11222-024-10462-0
Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas
While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.
{"title":"Hidden Markov models for multivariate panel data","authors":"Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas","doi":"10.1007/s11222-024-10462-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10462-0","url":null,"abstract":"<p>While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1007/s11222-024-10491-9
Li-Pang Chen
As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.
{"title":"Accelerated failure time models with error-prone response and nonlinear covariates","authors":"Li-Pang Chen","doi":"10.1007/s11222-024-10491-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10491-9","url":null,"abstract":"<p>As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1007/s11222-024-10499-1
Yue Huan, Hai Xiang Lin
In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.
{"title":"Sequential model identification with reversible jump ensemble data assimilation method","authors":"Yue Huan, Hai Xiang Lin","doi":"10.1007/s11222-024-10499-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10499-1","url":null,"abstract":"<p>In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1007/s11222-024-10490-w
Julyan Arbel, Stéphane Girard, Hadrien Lorenzo
This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.
这项研究的重点是条件极值建模的降维技术。具体来说,我们研究了这样一种观点,即响应变量的极值可以用输入随机向量的线性投影得出的非线性函数来解释。在此背景下,我们研究了极值偏最小二乘法(EPLS)对投影方向的估计,该方法是对原始偏最小二乘法(PLS)的改良,专门针对极值框架而设计。此外,利用应用于超球的 von Mises-Fisher 分布,引入了将 EPLS 方向解释为最大似然估计器的新方法。通过贝叶斯范式增强了维度缩减过程,从而将先验信息纳入投影方向估计。最大后验估计器在两种特定情况下得出,阐明了它是 EPLS 估计器的正则化或缩小。我们还确定了其在样本量接近无穷大时的渐近行为。为了评估我们提出的方法的实用性,我们进行了一项模拟数据研究。这清楚地表明,即使在高维设置下的中等数据问题中,该方法也非常有效。此外,我们还利用法国的农业收入数据举例说明了该方法的适用性,突出了它在现实世界中的功效。
{"title":"Shrinkage for extreme partial least-squares","authors":"Julyan Arbel, Stéphane Girard, Hadrien Lorenzo","doi":"10.1007/s11222-024-10490-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10490-w","url":null,"abstract":"<p>This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s11222-024-10492-8
Jiawei Wen, Songshan Yang, Delin Zhao
The Dantzig selector is a popular (ell _1)-type variable selection method widely used across various research fields. However, (ell _1)-type methods may not perform well for variable selection without complex irrepresentable conditions. In this article, we introduce a nonconvex Dantzig selector for ultrahigh-dimensional linear models. We begin by demonstrating that the oracle estimator serves as a local optimum for the nonconvex Dantzig selector. In addition, we propose a one-step local linear approximation estimator, called the Dantzig-LLA estimator, for the nonconvex Dantzig selector, and establish its strong oracle property. The proposed regularization method avoids the restrictive conditions imposed by (ell _1) regularization methods to guarantee the model selection consistency. Furthermore, we propose an efficient and parallelizable computing algorithm based on feature-splitting to address the computational challenges associated with the nonconvex Dantzig selector in high-dimensional settings. A comprehensive numerical study is conducted to evaluate the performance of the nonconvex Dantzig selector and the computing efficiency of the feature-splitting algorithm. The results demonstrate that the Dantzig selector with nonconvex penalty outperforms the (ell _1) penalty-based selector, and the feature-splitting algorithm performs well in high-dimensional settings where linear programming solver may fail. Finally, we generalize the concept of nonconvex Dantzig selector to deal with more general loss functions.
{"title":"Nonconvex Dantzig selector and its parallel computing algorithm","authors":"Jiawei Wen, Songshan Yang, Delin Zhao","doi":"10.1007/s11222-024-10492-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10492-8","url":null,"abstract":"<p>The Dantzig selector is a popular <span>(ell _1)</span>-type variable selection method widely used across various research fields. However, <span>(ell _1)</span>-type methods may not perform well for variable selection without complex irrepresentable conditions. In this article, we introduce a nonconvex Dantzig selector for ultrahigh-dimensional linear models. We begin by demonstrating that the oracle estimator serves as a local optimum for the nonconvex Dantzig selector. In addition, we propose a one-step local linear approximation estimator, called the Dantzig-LLA estimator, for the nonconvex Dantzig selector, and establish its strong oracle property. The proposed regularization method avoids the restrictive conditions imposed by <span>(ell _1)</span> regularization methods to guarantee the model selection consistency. Furthermore, we propose an efficient and parallelizable computing algorithm based on feature-splitting to address the computational challenges associated with the nonconvex Dantzig selector in high-dimensional settings. A comprehensive numerical study is conducted to evaluate the performance of the nonconvex Dantzig selector and the computing efficiency of the feature-splitting algorithm. The results demonstrate that the Dantzig selector with nonconvex penalty outperforms the <span>(ell _1)</span> penalty-based selector, and the feature-splitting algorithm performs well in high-dimensional settings where linear programming solver may fail. Finally, we generalize the concept of nonconvex Dantzig selector to deal with more general loss functions.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-11DOI: 10.1007/s11222-024-10493-7
Subhrajyoty Roy, Abhik Ghosh, Ayanendranath Basu
The traditional method of computing singular value decomposition (SVD) of a data matrix is based on the least squares principle and is, therefore, very sensitive to the presence of outliers. Hence, the resulting inferences across different applications using the classical SVD are extremely degraded in the presence of data contamination. In particular, background modelling of video surveillance data in the presence of camera tampering cannot be reliably solved by the classical SVD. In this paper, we propose a novel robust singular value decomposition technique based on the popular minimum density power divergence estimator. We have established the theoretical properties of the proposed estimator such as convergence, equivariance and consistency under the high-dimensional regime where both the row and column dimensions of the data matrix approach infinity. We also propose a fast and scalable algorithm based on alternating weighted regression to obtain the estimate. Within the scope of our fairly extensive simulation studies, our method performs better than existing robust SVD algorithms. Finally, we present an application of the proposed method on the video surveillance background modelling problem.
{"title":"Robust singular value decomposition with application to video surveillance background modelling","authors":"Subhrajyoty Roy, Abhik Ghosh, Ayanendranath Basu","doi":"10.1007/s11222-024-10493-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10493-7","url":null,"abstract":"<p>The traditional method of computing singular value decomposition (SVD) of a data matrix is based on the least squares principle and is, therefore, very sensitive to the presence of outliers. Hence, the resulting inferences across different applications using the classical SVD are extremely degraded in the presence of data contamination. In particular, background modelling of video surveillance data in the presence of camera tampering cannot be reliably solved by the classical SVD. In this paper, we propose a novel robust singular value decomposition technique based on the popular minimum density power divergence estimator. We have established the theoretical properties of the proposed estimator such as convergence, equivariance and consistency under the high-dimensional regime where both the row and column dimensions of the data matrix approach infinity. We also propose a fast and scalable algorithm based on alternating weighted regression to obtain the estimate. Within the scope of our fairly extensive simulation studies, our method performs better than existing robust SVD algorithms. Finally, we present an application of the proposed method on the video surveillance background modelling problem.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1007/s11222-024-10485-7
Almog Peer, David Azriel
Estimating the probability of the binomial distribution is a basic problem, which appears in almost all introductory statistics courses and is performed frequently in various studies. In some cases, the parameter of interest is a difference between two probabilities, and the current work studies the construction of confidence intervals for this parameter when the sample size is small. Our goal is to find the shortest confidence intervals under the constraint of coverage probability being at least as large as a predetermined level. For the two-sample case, there is no known algorithm that achieves this goal, but different heuristics procedures have been suggested, and the present work aims at finding optimal confidence intervals. In the one-sample case, there is a known algorithm that finds optimal confidence intervals presented by Blyth and Still (J Am Stat Assoc 78(381):108–116, 1983). It is based on solving small and local optimization problems and then using an inversion step to find the global optimum solution. We show that this approach fails in the two-sample case and therefore, in order to find optimal confidence intervals, one needs to solve a global optimization problem, rather than small and local ones, which is computationally much harder. We present and discuss the suitable global optimization problem. Using the Gurobi package we find near-optimal solutions when the sample sizes are smaller than 15, and we compare these solutions to some existing methods, both approximate and exact. We find that the improvement in terms of lengths with respect to the best competitor varies between 1.5 and 5% for different parameters of the problem. Therefore, we recommend the use of the new confidence intervals when both sample sizes are smaller than 15. Tables of the confidence intervals are given in the Excel file in this link (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl).
估计二项分布的概率是一个基本问题,几乎出现在所有统计学入门课程中,在各种研究中也经常出现。在某些情况下,感兴趣的参数是两个概率之间的差值,而目前的工作研究的是在样本量较小时如何构建该参数的置信区间。我们的目标是在覆盖概率至少与预定水平一样大的约束条件下,找到最短的置信区间。对于双样本情况,目前还没有已知的算法可以实现这一目标,但已经提出了不同的启发式程序,而本研究的目标就是找到最优置信区间。在单样本情况下,Blyth 和 Still(J Am Stat Assoc 78(381):108-116,1983 年)提出了一种已知的求最佳置信区间的算法。该算法基于求解小型局部优化问题,然后使用反演步骤找到全局最优解。我们的研究表明,这种方法在双样本情况下失效,因此,为了找到最优置信区间,我们需要解决全局优化问题,而不是计算难度更大的小型局部优化问题。我们提出并讨论了合适的全局优化问题。利用 Gurobi 软件包,我们找到了样本量小于 15 时的近似最优解,并将这些解与现有的一些近似和精确方法进行了比较。我们发现,对于问题的不同参数,相对于最佳竞争者的长度改进在 1.5 至 5%之间。因此,我们建议在样本量都小于 15 时使用新的置信区间。置信区间表见此链接中的 Excel 文件 (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl)。
{"title":"Optimal confidence interval for the difference between proportions","authors":"Almog Peer, David Azriel","doi":"10.1007/s11222-024-10485-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10485-7","url":null,"abstract":"<p>Estimating the probability of the binomial distribution is a basic problem, which appears in almost all introductory statistics courses and is performed frequently in various studies. In some cases, the parameter of interest is a difference between two probabilities, and the current work studies the construction of confidence intervals for this parameter when the sample size is small. Our goal is to find the shortest confidence intervals under the constraint of coverage probability being at least as large as a predetermined level. For the two-sample case, there is no known algorithm that achieves this goal, but different heuristics procedures have been suggested, and the present work aims at finding optimal confidence intervals. In the one-sample case, there is a known algorithm that finds optimal confidence intervals presented by Blyth and Still (J Am Stat Assoc 78(381):108–116, 1983). It is based on solving small and local optimization problems and then using an inversion step to find the global optimum solution. We show that this approach fails in the two-sample case and therefore, in order to find optimal confidence intervals, one needs to solve a global optimization problem, rather than small and local ones, which is computationally much harder. We present and discuss the suitable global optimization problem. Using the Gurobi package we find near-optimal solutions when the sample sizes are smaller than 15, and we compare these solutions to some existing methods, both approximate and exact. We find that the improvement in terms of lengths with respect to the best competitor varies between 1.5 and 5% for different parameters of the problem. Therefore, we recommend the use of the new confidence intervals when both sample sizes are smaller than 15. Tables of the confidence intervals are given in the Excel file in this link (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl).</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s11222-024-10487-5
Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang
We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small ((n=50)), moderate ((n=100)), and large ((n=500)) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.
{"title":"A comprehensive comparison of goodness-of-fit tests for logistic regression models","authors":"Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang","doi":"10.1007/s11222-024-10487-5","DOIUrl":"https://doi.org/10.1007/s11222-024-10487-5","url":null,"abstract":"<p>We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small (<span>(n=50)</span>), moderate (<span>(n=100)</span>), and large (<span>(n=500)</span>) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s11222-024-10482-w
Shuang Dai, Ping Wu, Zhou Yu
Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.
充分降维(SDR)的主要目的是降低高维预测变量的维度,同时保留反应的基本信息。传统的降维方法通常采用核加权函数,但不幸的是,这种方法容易受到维度诅咒的影响。为了解决这个问题,我们在本文中提出了基于森林的新型 SDR 方法,该方法利用蒙德里安森林生成的局部自适应核。总体而言,我们的工作从蒙德里安森林的角度出发,将其作为一种用于 SDR 问题的自适应加权核技术。在中心均值子空间模型中,通过将 Xia 等人的方法(J R Stat Soc Ser B (Stat Methodol) 64(3):363-410, 2002. https://doi.org/10.1111/1467-9868.03411)与蒙德里安森林权重相结合,我们提出了基于森林的梯度外积估计(mf-OPG)和基于森林的最小平均方差估计(mf-MAVE)。此外,我们还用蒙德里安森林权重替代了非参数密度函数估计中使用的核(Xia 在 Ann Stat 35(6):2654-2690, 2007. https://doi.org/10.1214/009053607000000352),以中心子空间为目标。这些技术分别称为 mf-dOPG 和 mf-dMAVE。在正则条件下,我们建立了基于森林的估计器的渐近特性,以及附属算法的收敛性。通过模拟研究和对完全可观测数据的分析,我们证明了与传统方法相比,我们的建议在计算效率和预测准确性方面都有了大幅提高。
{"title":"New forest-based approaches for sufficient dimension reduction","authors":"Shuang Dai, Ping Wu, Zhou Yu","doi":"10.1007/s11222-024-10482-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10482-w","url":null,"abstract":"<p>Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1007/s11222-024-10486-6
Samuel Stockman, Daniel J. Lawson, Maximilian J. Werner
The rapid growth of earthquake catalogs, driven by machine learning-based phase picking and denser seismic networks, calls for the application of a broader range of models to determine whether the new data enhances earthquake forecasting capabilities. Additionally, this growth demands that existing forecasting models efficiently scale to handle the increased data volume. Approximate inference methods such as inlabru, which is based on the Integrated nested Laplace approximation, offer improved computational efficiencies and the ability to perform inference on more complex point-process models compared to traditional MCMC approaches. We present SB-ETAS: a simulation based inference procedure for the epidemic-type aftershock sequence (ETAS) model. This approximate Bayesian method uses sequential neural posterior estimation (SNPE) to learn posterior distributions from simulations, rather than typical MCMC sampling using the likelihood. On synthetic earthquake catalogs, SB-ETAS provides better coverage of ETAS posterior distributions compared with inlabru. Furthermore, we demonstrate that using a simulation based procedure for inference improves the scalability from (mathcal {O}(n^2)) to (mathcal {O}(nlog n)). This makes it feasible to fit to very large earthquake catalogs, such as one for Southern California dating back to 1981. SB-ETAS can find Bayesian estimates of ETAS parameters for this catalog in less than 10 h on a standard laptop, a task that would have taken over 2 weeks using MCMC. Beyond the standard ETAS model, this simulation based framework allows earthquake modellers to define and infer parameters for much more complex models by removing the need to define a likelihood function.
{"title":"SB-ETAS: using simulation based inference for scalable, likelihood-free inference for the ETAS model of earthquake occurrences","authors":"Samuel Stockman, Daniel J. Lawson, Maximilian J. Werner","doi":"10.1007/s11222-024-10486-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10486-6","url":null,"abstract":"<p>The rapid growth of earthquake catalogs, driven by machine learning-based phase picking and denser seismic networks, calls for the application of a broader range of models to determine whether the new data enhances earthquake forecasting capabilities. Additionally, this growth demands that existing forecasting models efficiently scale to handle the increased data volume. Approximate inference methods such as <span>inlabru</span>, which is based on the Integrated nested Laplace approximation, offer improved computational efficiencies and the ability to perform inference on more complex point-process models compared to traditional MCMC approaches. We present SB-ETAS: a simulation based inference procedure for the epidemic-type aftershock sequence (ETAS) model. This approximate Bayesian method uses sequential neural posterior estimation (SNPE) to learn posterior distributions from simulations, rather than typical MCMC sampling using the likelihood. On synthetic earthquake catalogs, SB-ETAS provides better coverage of ETAS posterior distributions compared with <span>inlabru</span>. Furthermore, we demonstrate that using a simulation based procedure for inference improves the scalability from <span>(mathcal {O}(n^2))</span> to <span>(mathcal {O}(nlog n))</span>. This makes it feasible to fit to very large earthquake catalogs, such as one for Southern California dating back to 1981. SB-ETAS can find Bayesian estimates of ETAS parameters for this catalog in less than 10 h on a standard laptop, a task that would have taken over 2 weeks using MCMC. Beyond the standard ETAS model, this simulation based framework allows earthquake modellers to define and infer parameters for much more complex models by removing the need to define a likelihood function.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}