首页 > 最新文献

Statistics and Computing最新文献

英文 中文
Hidden Markov models for multivariate panel data 多元面板数据的隐马尔可夫模型
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1007/s11222-024-10462-0
Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas

While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.

尽管基于模型的聚类技术不断进步,但在对面板数据等各种数据类型进行建模时仍面临挑战。多变量面板数据给聚类算法带来了困难,因为它们经常受到缺失数据和遗漏数据的困扰,给估计算法带来了问题。本研究提出了一系列隐马尔可夫模型,以弥补面板数据中出现的问题。本文提出了一种能够处理非随机数据缺失和遗漏的修正期望最大化算法,并将其用于模型估计。
{"title":"Hidden Markov models for multivariate panel data","authors":"Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas","doi":"10.1007/s11222-024-10462-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10462-0","url":null,"abstract":"<p>While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerated failure time models with error-prone response and nonlinear covariates 具有易出错响应和非线性协变量的加速故障时间模型
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1007/s11222-024-10491-9
Li-Pang Chen

As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.

作为生存分析的一个具体应用,医学研究的主要兴趣之一是分析特定癌症患者的生存时间。通常情况下,基因表达被视为协变量来描述生存时间。在生存分析框架中,参数形式的加速失效时间模型也许是一种常见的方法。然而,基因表达可能是非线性的,生存时间和普查状态也会受到测量误差的影响。本文旨在同时解决这些复杂的问题。我们首先修正了生存时间和普查状态的测量误差,并利用它们开发了一个修正的巴克利-詹姆斯估计器。之后,我们使用提升算法和三次样条估计方法迭代恢复协变量和生存时间之间的非线性关系。我们从理论上证明了测量误差校正和估计程序的有效性。数值研究表明,所提出的方法提高了估计的性能,并能捕捉到有信息量的协变量。该方法主要用于分析荷兰癌症研究所提供的乳腺癌研究数据。
{"title":"Accelerated failure time models with error-prone response and nonlinear covariates","authors":"Li-Pang Chen","doi":"10.1007/s11222-024-10491-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10491-9","url":null,"abstract":"<p>As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential model identification with reversible jump ensemble data assimilation method 采用可逆跃迁集合数据同化方法进行序列模型识别
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1007/s11222-024-10499-1
Yue Huan, Hai Xiang Lin

In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.

在数据同化(DA)方案中,除了一些需要估算的参数外,演化模型中表示过程的形式都是预先确定的。在某些应用中,如污染物溶质传输模型和储气库模型,演化模型中的方程模式无法从一开始就预先确定,可能会随着时间的推移而改变。我们提出了一种名为 "可逆跃迁集合滤波器(RJEnF)"的序列分析方法框架,用于识别演化模型随时间变化的支配模式。其主要思想是将可逆跃迁马尔可夫链蒙特卡洛(RJMCMC)方法引入数模转换方案,以适应演化模型模式未知且参数维度不断变化的情况。我们的框架允许我们识别演化模型中的模式及其变化,以及估计动态系统的参数和状态。我们进行了数值实验,结果表明我们的框架能有效识别底层演化模型,提高数模转换方法的预测精度。
{"title":"Sequential model identification with reversible jump ensemble data assimilation method","authors":"Yue Huan, Hai Xiang Lin","doi":"10.1007/s11222-024-10499-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10499-1","url":null,"abstract":"<p>In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shrinkage for extreme partial least-squares 极端部分最小二乘法的收缩
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1007/s11222-024-10490-w
Julyan Arbel, Stéphane Girard, Hadrien Lorenzo

This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.

这项研究的重点是条件极值建模的降维技术。具体来说,我们研究了这样一种观点,即响应变量的极值可以用输入随机向量的线性投影得出的非线性函数来解释。在此背景下,我们研究了极值偏最小二乘法(EPLS)对投影方向的估计,该方法是对原始偏最小二乘法(PLS)的改良,专门针对极值框架而设计。此外,利用应用于超球的 von Mises-Fisher 分布,引入了将 EPLS 方向解释为最大似然估计器的新方法。通过贝叶斯范式增强了维度缩减过程,从而将先验信息纳入投影方向估计。最大后验估计器在两种特定情况下得出,阐明了它是 EPLS 估计器的正则化或缩小。我们还确定了其在样本量接近无穷大时的渐近行为。为了评估我们提出的方法的实用性,我们进行了一项模拟数据研究。这清楚地表明,即使在高维设置下的中等数据问题中,该方法也非常有效。此外,我们还利用法国的农业收入数据举例说明了该方法的适用性,突出了它在现实世界中的功效。
{"title":"Shrinkage for extreme partial least-squares","authors":"Julyan Arbel, Stéphane Girard, Hadrien Lorenzo","doi":"10.1007/s11222-024-10490-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10490-w","url":null,"abstract":"<p>This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonconvex Dantzig selector and its parallel computing algorithm 非凸丹齐格选择器及其并行计算算法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-16 DOI: 10.1007/s11222-024-10492-8
Jiawei Wen, Songshan Yang, Delin Zhao

The Dantzig selector is a popular (ell _1)-type variable selection method widely used across various research fields. However, (ell _1)-type methods may not perform well for variable selection without complex irrepresentable conditions. In this article, we introduce a nonconvex Dantzig selector for ultrahigh-dimensional linear models. We begin by demonstrating that the oracle estimator serves as a local optimum for the nonconvex Dantzig selector. In addition, we propose a one-step local linear approximation estimator, called the Dantzig-LLA estimator, for the nonconvex Dantzig selector, and establish its strong oracle property. The proposed regularization method avoids the restrictive conditions imposed by (ell _1) regularization methods to guarantee the model selection consistency. Furthermore, we propose an efficient and parallelizable computing algorithm based on feature-splitting to address the computational challenges associated with the nonconvex Dantzig selector in high-dimensional settings. A comprehensive numerical study is conducted to evaluate the performance of the nonconvex Dantzig selector and the computing efficiency of the feature-splitting algorithm. The results demonstrate that the Dantzig selector with nonconvex penalty outperforms the (ell _1) penalty-based selector, and the feature-splitting algorithm performs well in high-dimensional settings where linear programming solver may fail. Finally, we generalize the concept of nonconvex Dantzig selector to deal with more general loss functions.

丹齐格选择器是一种流行的(ell _1)型变量选择方法,广泛应用于各个研究领域。然而,在没有复杂的不可呈现条件的情况下,(ell _1)型方法可能无法很好地实现变量选择。本文将介绍一种用于超高维线性模型的非凸 Dantzig 选择器。我们首先证明了甲骨文估计器可以作为非凸 Dantzig 选择器的局部最优。此外,我们还为非凸 Dantzig 选择器提出了一种一步本地线性近似估计器,称为 Dantzig-LLA 估计器,并建立了其强甲骨文特性。所提出的正则化方法避免了(ell _1)正则化方法为保证模型选择一致性而施加的限制条件。此外,我们还提出了一种基于特征分割的高效可并行计算算法,以解决高维环境下与非凸 Dantzig 选择器相关的计算难题。我们进行了全面的数值研究,以评估非凸 Dantzig 选择器的性能和特征分割算法的计算效率。结果表明,具有非凸惩罚的 Dantzig 选择器优于基于惩罚的选择器,而特征分割算法在线性规划求解器可能失效的高维环境中表现良好。最后,我们将非凸 Dantzig 选择器的概念推广到处理更一般的损失函数。
{"title":"Nonconvex Dantzig selector and its parallel computing algorithm","authors":"Jiawei Wen, Songshan Yang, Delin Zhao","doi":"10.1007/s11222-024-10492-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10492-8","url":null,"abstract":"<p>The Dantzig selector is a popular <span>(ell _1)</span>-type variable selection method widely used across various research fields. However, <span>(ell _1)</span>-type methods may not perform well for variable selection without complex irrepresentable conditions. In this article, we introduce a nonconvex Dantzig selector for ultrahigh-dimensional linear models. We begin by demonstrating that the oracle estimator serves as a local optimum for the nonconvex Dantzig selector. In addition, we propose a one-step local linear approximation estimator, called the Dantzig-LLA estimator, for the nonconvex Dantzig selector, and establish its strong oracle property. The proposed regularization method avoids the restrictive conditions imposed by <span>(ell _1)</span> regularization methods to guarantee the model selection consistency. Furthermore, we propose an efficient and parallelizable computing algorithm based on feature-splitting to address the computational challenges associated with the nonconvex Dantzig selector in high-dimensional settings. A comprehensive numerical study is conducted to evaluate the performance of the nonconvex Dantzig selector and the computing efficiency of the feature-splitting algorithm. The results demonstrate that the Dantzig selector with nonconvex penalty outperforms the <span>(ell _1)</span> penalty-based selector, and the feature-splitting algorithm performs well in high-dimensional settings where linear programming solver may fail. Finally, we generalize the concept of nonconvex Dantzig selector to deal with more general loss functions.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust singular value decomposition with application to video surveillance background modelling 将鲁棒奇异值分解应用于视频监控背景建模
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-11 DOI: 10.1007/s11222-024-10493-7
Subhrajyoty Roy, Abhik Ghosh, Ayanendranath Basu

The traditional method of computing singular value decomposition (SVD) of a data matrix is based on the least squares principle and is, therefore, very sensitive to the presence of outliers. Hence, the resulting inferences across different applications using the classical SVD are extremely degraded in the presence of data contamination. In particular, background modelling of video surveillance data in the presence of camera tampering cannot be reliably solved by the classical SVD. In this paper, we propose a novel robust singular value decomposition technique based on the popular minimum density power divergence estimator. We have established the theoretical properties of the proposed estimator such as convergence, equivariance and consistency under the high-dimensional regime where both the row and column dimensions of the data matrix approach infinity. We also propose a fast and scalable algorithm based on alternating weighted regression to obtain the estimate. Within the scope of our fairly extensive simulation studies, our method performs better than existing robust SVD algorithms. Finally, we present an application of the proposed method on the video surveillance background modelling problem.

计算数据矩阵奇异值分解(SVD)的传统方法基于最小二乘法原理,因此对异常值非常敏感。因此,在数据污染的情况下,使用传统 SVD 计算出的不同应用推断结果会大打折扣。特别是在摄像头被篡改的情况下,经典 SVD 无法可靠地解决视频监控数据的背景建模问题。在本文中,我们提出了一种基于流行的最小密度功率发散估计器的新型鲁棒奇异值分解技术。我们建立了所提估计器的理论特性,如在数据矩阵的行维和列维都接近无穷大的高维条件下的收敛性、等差性和一致性。我们还提出了一种基于交替加权回归的快速、可扩展算法来获取估计值。在我们相当广泛的模拟研究范围内,我们的方法比现有的鲁棒 SVD 算法表现更好。最后,我们介绍了所提方法在视频监控背景建模问题上的应用。
{"title":"Robust singular value decomposition with application to video surveillance background modelling","authors":"Subhrajyoty Roy, Abhik Ghosh, Ayanendranath Basu","doi":"10.1007/s11222-024-10493-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10493-7","url":null,"abstract":"<p>The traditional method of computing singular value decomposition (SVD) of a data matrix is based on the least squares principle and is, therefore, very sensitive to the presence of outliers. Hence, the resulting inferences across different applications using the classical SVD are extremely degraded in the presence of data contamination. In particular, background modelling of video surveillance data in the presence of camera tampering cannot be reliably solved by the classical SVD. In this paper, we propose a novel robust singular value decomposition technique based on the popular minimum density power divergence estimator. We have established the theoretical properties of the proposed estimator such as convergence, equivariance and consistency under the high-dimensional regime where both the row and column dimensions of the data matrix approach infinity. We also propose a fast and scalable algorithm based on alternating weighted regression to obtain the estimate. Within the scope of our fairly extensive simulation studies, our method performs better than existing robust SVD algorithms. Finally, we present an application of the proposed method on the video surveillance background modelling problem.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal confidence interval for the difference between proportions 比例差异的最佳置信区间
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-02 DOI: 10.1007/s11222-024-10485-7
Almog Peer, David Azriel

Estimating the probability of the binomial distribution is a basic problem, which appears in almost all introductory statistics courses and is performed frequently in various studies. In some cases, the parameter of interest is a difference between two probabilities, and the current work studies the construction of confidence intervals for this parameter when the sample size is small. Our goal is to find the shortest confidence intervals under the constraint of coverage probability being at least as large as a predetermined level. For the two-sample case, there is no known algorithm that achieves this goal, but different heuristics procedures have been suggested, and the present work aims at finding optimal confidence intervals. In the one-sample case, there is a known algorithm that finds optimal confidence intervals presented by Blyth and Still (J Am Stat Assoc 78(381):108–116, 1983). It is based on solving small and local optimization problems and then using an inversion step to find the global optimum solution. We show that this approach fails in the two-sample case and therefore, in order to find optimal confidence intervals, one needs to solve a global optimization problem, rather than small and local ones, which is computationally much harder. We present and discuss the suitable global optimization problem. Using the Gurobi package we find near-optimal solutions when the sample sizes are smaller than 15, and we compare these solutions to some existing methods, both approximate and exact. We find that the improvement in terms of lengths with respect to the best competitor varies between 1.5 and 5% for different parameters of the problem. Therefore, we recommend the use of the new confidence intervals when both sample sizes are smaller than 15. Tables of the confidence intervals are given in the Excel file in this link (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl).

估计二项分布的概率是一个基本问题,几乎出现在所有统计学入门课程中,在各种研究中也经常出现。在某些情况下,感兴趣的参数是两个概率之间的差值,而目前的工作研究的是在样本量较小时如何构建该参数的置信区间。我们的目标是在覆盖概率至少与预定水平一样大的约束条件下,找到最短的置信区间。对于双样本情况,目前还没有已知的算法可以实现这一目标,但已经提出了不同的启发式程序,而本研究的目标就是找到最优置信区间。在单样本情况下,Blyth 和 Still(J Am Stat Assoc 78(381):108-116,1983 年)提出了一种已知的求最佳置信区间的算法。该算法基于求解小型局部优化问题,然后使用反演步骤找到全局最优解。我们的研究表明,这种方法在双样本情况下失效,因此,为了找到最优置信区间,我们需要解决全局优化问题,而不是计算难度更大的小型局部优化问题。我们提出并讨论了合适的全局优化问题。利用 Gurobi 软件包,我们找到了样本量小于 15 时的近似最优解,并将这些解与现有的一些近似和精确方法进行了比较。我们发现,对于问题的不同参数,相对于最佳竞争者的长度改进在 1.5 至 5%之间。因此,我们建议在样本量都小于 15 时使用新的置信区间。置信区间表见此链接中的 Excel 文件 (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl)。
{"title":"Optimal confidence interval for the difference between proportions","authors":"Almog Peer, David Azriel","doi":"10.1007/s11222-024-10485-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10485-7","url":null,"abstract":"<p>Estimating the probability of the binomial distribution is a basic problem, which appears in almost all introductory statistics courses and is performed frequently in various studies. In some cases, the parameter of interest is a difference between two probabilities, and the current work studies the construction of confidence intervals for this parameter when the sample size is small. Our goal is to find the shortest confidence intervals under the constraint of coverage probability being at least as large as a predetermined level. For the two-sample case, there is no known algorithm that achieves this goal, but different heuristics procedures have been suggested, and the present work aims at finding optimal confidence intervals. In the one-sample case, there is a known algorithm that finds optimal confidence intervals presented by Blyth and Still (J Am Stat Assoc 78(381):108–116, 1983). It is based on solving small and local optimization problems and then using an inversion step to find the global optimum solution. We show that this approach fails in the two-sample case and therefore, in order to find optimal confidence intervals, one needs to solve a global optimization problem, rather than small and local ones, which is computationally much harder. We present and discuss the suitable global optimization problem. Using the Gurobi package we find near-optimal solutions when the sample sizes are smaller than 15, and we compare these solutions to some existing methods, both approximate and exact. We find that the improvement in terms of lengths with respect to the best competitor varies between 1.5 and 5% for different parameters of the problem. Therefore, we recommend the use of the new confidence intervals when both sample sizes are smaller than 15. Tables of the confidence intervals are given in the Excel file in this link (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl).</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive comparison of goodness-of-fit tests for logistic regression models 逻辑回归模型拟合优度检验的综合比较
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-30 DOI: 10.1007/s11222-024-10487-5
Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang

We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small ((n=50)), moderate ((n=100)), and large ((n=500)) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.

我们介绍了一种基于投影的检验方法,用于评估使用经验残差标记经验过程的逻辑回归模型,并提出了一种基于模型的引导程序来计算临界值。我们将该检验、Stute和Zhu的检验与几种常用的拟合优度(GoF)检验:Hosmer-Lemeshow检验、修正的Hosmer-Lemeshow检验、Osius-Rojek检验和Stukel检验进行了全面比较,这些检验适用于小样本量((n=50))、中等样本量((n=100))和大样本量((n=500))的逻辑回归模型的I型误差控制和功率性能。我们评估了两种常见情况下的功率性能:偏离零假设的非线性和交互作用。除了修正的 Hosmer-Lemeshow 检验和 Osius-Rojek 检验外,所有检验在所有样本量下都有正确的大小。基于投影的检验的功率性能一直优于其竞争对手。我们将这些检验用于分析艾滋病数据集和癌症数据集。对于前者,除基于投影的检验外,所有检验都不能拒绝 logit 中的简单线性函数,而文献中已经说明了这种线性函数的缺陷。对于后一个数据集,Hosmer-Lemeshow 检验、修正的 Hosmer-Lemeshow 检验和 Osius-Rojek 检验均未能检测出对数中的二次函数形式,而 Stukel 检验、Stute 和 Zhu 检验以及基于投影的检验均检测出了二次函数形式。
{"title":"A comprehensive comparison of goodness-of-fit tests for logistic regression models","authors":"Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang","doi":"10.1007/s11222-024-10487-5","DOIUrl":"https://doi.org/10.1007/s11222-024-10487-5","url":null,"abstract":"<p>We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small (<span>(n=50)</span>), moderate (<span>(n=100)</span>), and large (<span>(n=500)</span>) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
New forest-based approaches for sufficient dimension reduction 基于森林的充分降维新方法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-30 DOI: 10.1007/s11222-024-10482-w
Shuang Dai, Ping Wu, Zhou Yu

Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.

充分降维(SDR)的主要目的是降低高维预测变量的维度,同时保留反应的基本信息。传统的降维方法通常采用核加权函数,但不幸的是,这种方法容易受到维度诅咒的影响。为了解决这个问题,我们在本文中提出了基于森林的新型 SDR 方法,该方法利用蒙德里安森林生成的局部自适应核。总体而言,我们的工作从蒙德里安森林的角度出发,将其作为一种用于 SDR 问题的自适应加权核技术。在中心均值子空间模型中,通过将 Xia 等人的方法(J R Stat Soc Ser B (Stat Methodol) 64(3):363-410, 2002. https://doi.org/10.1111/1467-9868.03411)与蒙德里安森林权重相结合,我们提出了基于森林的梯度外积估计(mf-OPG)和基于森林的最小平均方差估计(mf-MAVE)。此外,我们还用蒙德里安森林权重替代了非参数密度函数估计中使用的核(Xia 在 Ann Stat 35(6):2654-2690, 2007. https://doi.org/10.1214/009053607000000352),以中心子空间为目标。这些技术分别称为 mf-dOPG 和 mf-dMAVE。在正则条件下,我们建立了基于森林的估计器的渐近特性,以及附属算法的收敛性。通过模拟研究和对完全可观测数据的分析,我们证明了与传统方法相比,我们的建议在计算效率和预测准确性方面都有了大幅提高。
{"title":"New forest-based approaches for sufficient dimension reduction","authors":"Shuang Dai, Ping Wu, Zhou Yu","doi":"10.1007/s11222-024-10482-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10482-w","url":null,"abstract":"<p>Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SB-ETAS: using simulation based inference for scalable, likelihood-free inference for the ETAS model of earthquake occurrences SB-ETAS:使用基于模拟的推理方法对地震发生的 ETAS 模型进行可扩展的无似然推理
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-29 DOI: 10.1007/s11222-024-10486-6
Samuel Stockman, Daniel J. Lawson, Maximilian J. Werner

The rapid growth of earthquake catalogs, driven by machine learning-based phase picking and denser seismic networks, calls for the application of a broader range of models to determine whether the new data enhances earthquake forecasting capabilities. Additionally, this growth demands that existing forecasting models efficiently scale to handle the increased data volume. Approximate inference methods such as inlabru, which is based on the Integrated nested Laplace approximation, offer improved computational efficiencies and the ability to perform inference on more complex point-process models compared to traditional MCMC approaches. We present SB-ETAS: a simulation based inference procedure for the epidemic-type aftershock sequence (ETAS) model. This approximate Bayesian method uses sequential neural posterior estimation (SNPE) to learn posterior distributions from simulations, rather than typical MCMC sampling using the likelihood. On synthetic earthquake catalogs, SB-ETAS provides better coverage of ETAS posterior distributions compared with inlabru. Furthermore, we demonstrate that using a simulation based procedure for inference improves the scalability from (mathcal {O}(n^2)) to (mathcal {O}(nlog n)). This makes it feasible to fit to very large earthquake catalogs, such as one for Southern California dating back to 1981. SB-ETAS can find Bayesian estimates of ETAS parameters for this catalog in less than 10 h on a standard laptop, a task that would have taken over 2 weeks using MCMC. Beyond the standard ETAS model, this simulation based framework allows earthquake modellers to define and infer parameters for much more complex models by removing the need to define a likelihood function.

在基于机器学习的相位选取和更密集的地震网络的推动下,地震目录的快速增长要求应用更广泛的模型来确定新数据是否增强了地震预报能力。此外,这种增长要求现有预报模型能够有效扩展,以处理增加的数据量。与传统的 MCMC 方法相比,基于集成嵌套拉普拉斯近似的近似推断方法(如 inlabru)提高了计算效率,并能对更复杂的点过程模型进行推断。我们提出了 SB-ETAS:一种基于模拟的流行病型余震序列(ETAS)模型推断程序。这种近似贝叶斯方法使用序列神经后验估计(SNPE)从模拟中学习后验分布,而不是使用似然进行典型的 MCMC 采样。在合成地震目录上,与 inlabru 相比,SB-ETAS 能更好地覆盖 ETAS 后验分布。此外,我们还证明了使用基于模拟的推理过程可以将可扩展性从(mathcal {O}(n^2)) 提高到(mathcal {O}(nlog n))。这使得它可以拟合非常大的地震目录,比如南加州可追溯到 1981 年的地震目录。SB-ETAS 可以在标准笔记本电脑上用不到 10 小时的时间为该目录找到 ETAS 参数的贝叶斯估计值,而使用 MCMC 则需要 2 周以上的时间。除标准 ETAS 模型外,这种基于模拟的框架还能让地震建模人员定义和推断更复杂模型的参数,无需定义似然函数。
{"title":"SB-ETAS: using simulation based inference for scalable, likelihood-free inference for the ETAS model of earthquake occurrences","authors":"Samuel Stockman, Daniel J. Lawson, Maximilian J. Werner","doi":"10.1007/s11222-024-10486-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10486-6","url":null,"abstract":"<p>The rapid growth of earthquake catalogs, driven by machine learning-based phase picking and denser seismic networks, calls for the application of a broader range of models to determine whether the new data enhances earthquake forecasting capabilities. Additionally, this growth demands that existing forecasting models efficiently scale to handle the increased data volume. Approximate inference methods such as <span>inlabru</span>, which is based on the Integrated nested Laplace approximation, offer improved computational efficiencies and the ability to perform inference on more complex point-process models compared to traditional MCMC approaches. We present SB-ETAS: a simulation based inference procedure for the epidemic-type aftershock sequence (ETAS) model. This approximate Bayesian method uses sequential neural posterior estimation (SNPE) to learn posterior distributions from simulations, rather than typical MCMC sampling using the likelihood. On synthetic earthquake catalogs, SB-ETAS provides better coverage of ETAS posterior distributions compared with <span>inlabru</span>. Furthermore, we demonstrate that using a simulation based procedure for inference improves the scalability from <span>(mathcal {O}(n^2))</span> to <span>(mathcal {O}(nlog n))</span>. This makes it feasible to fit to very large earthquake catalogs, such as one for Southern California dating back to 1981. SB-ETAS can find Bayesian estimates of ETAS parameters for this catalog in less than 10 h on a standard laptop, a task that would have taken over 2 weeks using MCMC. Beyond the standard ETAS model, this simulation based framework allows earthquake modellers to define and infer parameters for much more complex models by removing the need to define a likelihood function.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistics and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1