Pub Date : 2024-03-11DOI: 10.1007/s11222-024-10402-y
Alessandro Celani, Paolo Pagnottoni, Galin Jones
A Bayesian method is proposed for variable selection in high-dimensional matrix autoregressive models which reflects and exploits the original matrix structure of data to (a) reduce dimensionality and (b) foster interpretability of multidimensional relationship structures. A compact form of the model is derived which facilitates the estimation procedure and two computational methods for the estimation are proposed: a Markov chain Monte Carlo algorithm and a scalable Bayesian EM algorithm. Being based on the spike-and-slab framework for fast posterior mode identification, the latter enables Bayesian data analysis of matrix-valued time series at large scales. The theoretical properties, comparative performance, and computational efficiency of the proposed model is investigated through simulated examples and an application to a panel of country economic indicators.
针对高维矩阵自回归模型中的变量选择提出了一种贝叶斯方法,该方法反映并利用了数据的原始矩阵结构,以(a)降低维度和(b)提高多维关系结构的可解释性。该模型推导出一种简洁的形式,便于估算过程,并提出了两种估算计算方法:马尔科夫链蒙特卡罗算法和可扩展的贝叶斯 EM 算法。后者基于用于快速后验模式识别的尖峰和板块框架,能够在大尺度上对矩阵值时间序列进行贝叶斯数据分析。通过模拟实例和对国家经济指标面板的应用,研究了所提模型的理论特性、比较性能和计算效率。
{"title":"Bayesian variable selection for matrix autoregressive models","authors":"Alessandro Celani, Paolo Pagnottoni, Galin Jones","doi":"10.1007/s11222-024-10402-y","DOIUrl":"https://doi.org/10.1007/s11222-024-10402-y","url":null,"abstract":"<p>A Bayesian method is proposed for variable selection in high-dimensional matrix autoregressive models which reflects and exploits the original matrix structure of data to (a) reduce dimensionality and (b) foster interpretability of multidimensional relationship structures. A compact form of the model is derived which facilitates the estimation procedure and two computational methods for the estimation are proposed: a Markov chain Monte Carlo algorithm and a scalable Bayesian EM algorithm. Being based on the spike-and-slab framework for fast posterior mode identification, the latter enables Bayesian data analysis of matrix-valued time series at large scales. The theoretical properties, comparative performance, and computational efficiency of the proposed model is investigated through simulated examples and an application to a panel of country economic indicators.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140098943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-09DOI: 10.1007/s11222-024-10411-x
Hanâ Lbath, Alexander Petersen, Sophie Achard
Data produced by resting-state functional Magnetic Resonance Imaging are widely used to infer brain functional connectivity networks. Such networks correlate neural signals to connect brain regions, which consist in groups of dependent voxels. Previous work has focused on aggregating data across voxels within predefined regions. However, the presence of within-region correlations has noticeable impacts on inter-regional correlation detection, and thus edge identification. To alleviate them, we propose to leverage techniques from the large-scale correlation screening literature, and derive simple and practical characterizations of the mean number of correlation discoveries that flexibly incorporate intra-regional dependence structures. A connectivity network inference framework is then presented. First, inter-regional correlation distributions are estimated. Then, correlation thresholds that can be tailored to one’s application are constructed for each edge. Finally, the proposed framework is implemented on synthetic and real-world datasets. This novel approach for handling arbitrary intra-regional correlation is shown to limit false positives while improving true positive rates.
{"title":"Large-scale correlation screening under dependence for brain functional connectivity network inference","authors":"Hanâ Lbath, Alexander Petersen, Sophie Achard","doi":"10.1007/s11222-024-10411-x","DOIUrl":"https://doi.org/10.1007/s11222-024-10411-x","url":null,"abstract":"<p>Data produced by resting-state functional Magnetic Resonance Imaging are widely used to infer brain functional connectivity networks. Such networks correlate neural signals to connect brain regions, which consist in groups of dependent voxels. Previous work has focused on aggregating data across voxels within predefined regions. However, the presence of within-region correlations has noticeable impacts on inter-regional correlation detection, and thus edge identification. To alleviate them, we propose to leverage techniques from the large-scale correlation screening literature, and derive simple and practical characterizations of the mean number of correlation discoveries that flexibly incorporate intra-regional dependence structures. A connectivity network inference framework is then presented. First, inter-regional correlation distributions are estimated. Then, correlation thresholds that can be tailored to one’s application are constructed for each edge. Finally, the proposed framework is implemented on synthetic and real-world datasets. This novel approach for handling arbitrary intra-regional correlation is shown to limit false positives while improving true positive rates.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-08DOI: 10.1007/s11222-024-10408-6
Ruiting Hao, Xiaorong Yang
Quantile regression neural network (QRNN) model has received increasing attention in various fields to provide conditional quantiles of responses. However, almost all the available literature about QRNN is devoted to handling the case with one-dimensional responses, which presents a great limitation when we focus on the quantiles of multivariate responses. To deal with this issue, we propose a novel multiple-output quantile regression neural network (MOQRNN) model in this paper to estimate the conditional quantiles of multivariate data. The MOQRNN model is constructed by the following steps. Step 1 acquires the conditional distribution of multivariate responses by a nonparametric method. Step 2 obtains the optimal transport map that pushes the spherical uniform distribution forward to the conditional distribution through the input convex neural network (ICNN). Step 3 provides the conditional quantile contours and regions by the ICNN-based optimal transport map. In both simulation studies and real data application, comparative analyses with the existing method demonstrate that the proposed MOQRNN model is more appealing to yield excellent quantile contours, which are not only smoother but also closer to their theoretical counterparts.
{"title":"Multiple-output quantile regression neural network","authors":"Ruiting Hao, Xiaorong Yang","doi":"10.1007/s11222-024-10408-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10408-6","url":null,"abstract":"<p>Quantile regression neural network (QRNN) model has received increasing attention in various fields to provide conditional quantiles of responses. However, almost all the available literature about QRNN is devoted to handling the case with one-dimensional responses, which presents a great limitation when we focus on the quantiles of multivariate responses. To deal with this issue, we propose a novel multiple-output quantile regression neural network (MOQRNN) model in this paper to estimate the conditional quantiles of multivariate data. The MOQRNN model is constructed by the following steps. Step 1 acquires the conditional distribution of multivariate responses by a nonparametric method. Step 2 obtains the optimal transport map that pushes the spherical uniform distribution forward to the conditional distribution through the input convex neural network (ICNN). Step 3 provides the conditional quantile contours and regions by the ICNN-based optimal transport map. In both simulation studies and real data application, comparative analyses with the existing method demonstrate that the proposed MOQRNN model is more appealing to yield excellent quantile contours, which are not only smoother but also closer to their theoretical counterparts.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-05DOI: 10.1007/s11222-024-10398-5
Abstract
Recent studies have emphasized the connection between machine learning feature importance measures and total order sensitivity indices (total effects, henceforth). Feature correlations and the need to avoid unrestricted permutations make the estimation of these indices challenging. Additionally, there is no established theory or approach for non-Cartesian domains. We propose four alternative strategies for computing total effects that account for both dependent and constrained features. Our first approach involves a generalized winding stairs design combined with the Knothe-Rosenblatt transformation. This approach, while applicable to a wide family of input dependencies, becomes impractical when inputs are physically constrained. Our second approach is a U-statistic that combines the Jansen estimator with a weighting factor. The U-statistic framework allows the derivation of a central limit theorem for this estimator. However, this design is computationally intensive. Then, our third approach uses derangements to significantly reduce computational burden. We prove consistency and central limit theorems for these estimators as well. Our fourth approach is based on a nearest-neighbour intuition and it further reduces computational burden. We test these estimators through a series of increasingly complex computational experiments with features constrained on compact and connected domains (circle, simplex), non-compact and non-connected domains (Sierpinski gaskets), we provide comparisons with machine learning approaches and conclude with an application to a realistic simulator.
摘要 近期的研究强调了机器学习特征重要性度量与总阶灵敏度指数(以下简称总效应)之间的联系。特征相关性和避免无限制排列的需要使这些指数的估计具有挑战性。此外,对于非笛卡尔域还没有成熟的理论或方法。我们提出了四种计算总效应的替代策略,这些策略同时考虑了依赖特征和受限特征。我们的第一种方法是将广义缠绕阶梯设计与 Knothe-Rosenblatt 变换相结合。这种方法虽然适用于多种输入依赖关系,但当输入受到物理约束时,这种方法就变得不切实际了。我们的第二种方法是将扬森估计法与加权因子相结合的 U 统计法。U 统计框架允许推导出该估计器的中心极限定理。然而,这种设计需要大量计算。然后,我们的第三种方法利用导差大大减轻了计算负担。我们也证明了这些估计器的一致性和中心极限定理。我们的第四种方法基于近邻直觉,进一步减轻了计算负担。我们通过一系列越来越复杂的计算实验来测试这些估计器,实验中的特征受限于紧凑和连通的域(圆、单纯形)、非紧凑和非连通的域(Sierpinski 垫圈),我们将这些估计器与机器学习方法进行了比较,最后将其应用于一个现实的模拟器。
{"title":"Total effects with constrained features","authors":"","doi":"10.1007/s11222-024-10398-5","DOIUrl":"https://doi.org/10.1007/s11222-024-10398-5","url":null,"abstract":"<h3>Abstract</h3> <p>Recent studies have emphasized the connection between machine learning feature importance measures and total order sensitivity indices (total effects, henceforth). Feature correlations and the need to avoid unrestricted permutations make the estimation of these indices challenging. Additionally, there is no established theory or approach for non-Cartesian domains. We propose four alternative strategies for computing total effects that account for both dependent and constrained features. Our first approach involves a generalized winding stairs design combined with the Knothe-Rosenblatt transformation. This approach, while applicable to a wide family of input dependencies, becomes impractical when inputs are physically constrained. Our second approach is a U-statistic that combines the Jansen estimator with a weighting factor. The U-statistic framework allows the derivation of a central limit theorem for this estimator. However, this design is computationally intensive. Then, our third approach uses derangements to significantly reduce computational burden. We prove consistency and central limit theorems for these estimators as well. Our fourth approach is based on a nearest-neighbour intuition and it further reduces computational burden. We test these estimators through a series of increasingly complex computational experiments with features constrained on compact and connected domains (circle, simplex), non-compact and non-connected domains (Sierpinski gaskets), we provide comparisons with machine learning approaches and conclude with an application to a realistic simulator.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140035815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-05DOI: 10.1007/s11222-024-10397-6
Thomas Lux
In this article, an algorithm for maximum-likelihood estimation of regime-switching diffusions is proposed. The proposed approach uses a Fourier transform to numerically solve the system of Fokker–Planck or forward Kolmogorow equations for the temporal evolution of the state densities. Monte Carlo simulations confirm the theoretically expected consistency of this approach for moderate sample sizes and its practical feasibility for certain regime-switching diffusions used in economics and biology with moderate numbers of states and parameters. An application to animal movement data serves as an illustration of the proposed algorithm.
{"title":"Estimation of regime-switching diffusions via Fourier transforms","authors":"Thomas Lux","doi":"10.1007/s11222-024-10397-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10397-6","url":null,"abstract":"<p>In this article, an algorithm for maximum-likelihood estimation of regime-switching diffusions is proposed. The proposed approach uses a Fourier transform to numerically solve the system of Fokker–Planck or forward Kolmogorow equations for the temporal evolution of the state densities. Monte Carlo simulations confirm the theoretically expected consistency of this approach for moderate sample sizes and its practical feasibility for certain regime-switching diffusions used in economics and biology with moderate numbers of states and parameters. An application to animal movement data serves as an illustration of the proposed algorithm.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140035718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hilbert-Schmidt Independence Criterion (HSIC) has recently been introduced to the field of single-index models to estimate the directions. Compared with other well-established methods, the HSIC based method requires relatively weak conditions. However, its performance has not yet been studied in the prevalent high-dimensional scenarios, where the number of covariates can be much larger than the sample size. In this article, based on HSIC, we propose to estimate the possibly sparse directions in the high-dimensional single-index models through a parameter reformulation. Our approach estimates the subspace of the direction directly and performs variable selection simultaneously. Due to the non-convexity of the objective function and the complexity of the constraints, a majorize-minimize algorithm together with the linearized alternating direction method of multipliers is developed to solve the optimization problem. Since it does not involve the inverse of the covariance matrix, the algorithm can naturally handle large p small n scenarios. Through extensive simulation studies and a real data analysis, we show that our proposal is efficient and effective in the high-dimensional settings. The (texttt {Matlab}) codes for this method are available online.
希尔伯特-施密特独立准则(Hilbert-Schmidt Independence Criterion,HSIC)最近被引入单指数模型领域,用于估计方向。与其他成熟的方法相比,基于 HSIC 的方法所需的条件相对较弱。然而,在协变量数量可能远大于样本量的普遍高维情况下,该方法的性能尚未得到研究。本文以 HSIC 为基础,提出通过参数重构来估计高维单指标模型中可能存在的稀疏方向。我们的方法直接估计方向子空间,并同时进行变量选择。由于目标函数的非凸性和约束条件的复杂性,我们开发了一种大数最小化算法和线性化交替方向乘法来解决优化问题。由于该算法不涉及协方差矩阵的逆,因此可以自然地处理大 p 小 n 的情况。通过大量的模拟研究和真实数据分析,我们证明了我们的建议在高维环境下是高效和有效的。该方法的(texttt {Matlab} )代码可在线获取。
{"title":"High-dimensional sparse single–index regression via Hilbert–Schmidt independence criterion","authors":"Xin Chen, Chang Deng, Shuaida He, Runxiong Wu, Jia Zhang","doi":"10.1007/s11222-024-10399-4","DOIUrl":"https://doi.org/10.1007/s11222-024-10399-4","url":null,"abstract":"<p>Hilbert-Schmidt Independence Criterion (HSIC) has recently been introduced to the field of single-index models to estimate the directions. Compared with other well-established methods, the HSIC based method requires relatively weak conditions. However, its performance has not yet been studied in the prevalent high-dimensional scenarios, where the number of covariates can be much larger than the sample size. In this article, based on HSIC, we propose to estimate the possibly sparse directions in the high-dimensional single-index models through a parameter reformulation. Our approach estimates the subspace of the direction directly and performs variable selection simultaneously. Due to the non-convexity of the objective function and the complexity of the constraints, a majorize-minimize algorithm together with the linearized alternating direction method of multipliers is developed to solve the optimization problem. Since it does not involve the inverse of the covariance matrix, the algorithm can naturally handle large <i>p</i> small <i>n</i> scenarios. Through extensive simulation studies and a real data analysis, we show that our proposal is efficient and effective in the high-dimensional settings. The <span>(texttt {Matlab})</span> codes for this method are available online.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140005016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.1007/s11222-024-10392-x
Alex Ziyu Jiang, Abel Rodriguez
Multivariate Hawkes Processes (MHPs) are a class of point processes that can account for complex temporal dynamics among event sequences. In this work, we study the accuracy and computational efficiency of three classes of algorithms which, while widely used in the context of Bayesian inference, have rarely been applied in the context of MHPs: stochastic gradient expectation-maximization, stochastic gradient variational inference and stochastic gradient Langevin Monte Carlo. An important contribution of this paper is a novel approximation to the likelihood function that allows us to retain the computational advantages associated with conjugate settings while reducing approximation errors associated with the boundary effects. The comparisons are based on various simulated scenarios as well as an application to the study of risk dynamics in the Standard & Poor’s 500 intraday index prices among its 11 sectors.
{"title":"Improvements on scalable stochastic Bayesian inference methods for multivariate Hawkes process","authors":"Alex Ziyu Jiang, Abel Rodriguez","doi":"10.1007/s11222-024-10392-x","DOIUrl":"https://doi.org/10.1007/s11222-024-10392-x","url":null,"abstract":"<p>Multivariate Hawkes Processes (MHPs) are a class of point processes that can account for complex temporal dynamics among event sequences. In this work, we study the accuracy and computational efficiency of three classes of algorithms which, while widely used in the context of Bayesian inference, have rarely been applied in the context of MHPs: stochastic gradient expectation-maximization, stochastic gradient variational inference and stochastic gradient Langevin Monte Carlo. An important contribution of this paper is a novel approximation to the likelihood function that allows us to retain the computational advantages associated with conjugate settings while reducing approximation errors associated with the boundary effects. The comparisons are based on various simulated scenarios as well as an application to the study of risk dynamics in the Standard & Poor’s 500 intraday index prices among its 11 sectors.\u0000</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140005135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-23DOI: 10.1007/s11222-024-10400-0
Yuki Takazawa, Tomonari Sei
Phylogenetic trees are key data objects in biology, and the method of phylogenetic reconstruction has been highly developed. The space of phylogenetic trees is a nonpositively curved metric space. Recently, statistical methods to analyze samples of trees on this space are being developed utilizing this property. Meanwhile, in Euclidean space, the log-concave maximum likelihood method has emerged as a new nonparametric method for probability density estimation. In this paper, we derive a sufficient condition for the existence and uniqueness of the log-concave maximum likelihood estimator on tree space. We also propose an estimation algorithm for one and two dimensions. Since various factors affect the inferred trees, it is difficult to specify the distribution of a sample of trees. The class of log-concave densities is nonparametric, and yet the estimation can be conducted by the maximum likelihood method without selecting hyperparameters. We compare the estimation performance with a previously developed kernel density estimator numerically. In our examples where the true density is log-concave, we demonstrate that our estimator has a smaller integrated squared error when the sample size is large. We also conduct numerical experiments of clustering using the Expectation-Maximization algorithm and compare the results with k-means++ clustering using Fréchet mean.
{"title":"Maximum likelihood estimation of log-concave densities on tree space","authors":"Yuki Takazawa, Tomonari Sei","doi":"10.1007/s11222-024-10400-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10400-0","url":null,"abstract":"<p>Phylogenetic trees are key data objects in biology, and the method of phylogenetic reconstruction has been highly developed. The space of phylogenetic trees is a nonpositively curved metric space. Recently, statistical methods to analyze samples of trees on this space are being developed utilizing this property. Meanwhile, in Euclidean space, the log-concave maximum likelihood method has emerged as a new nonparametric method for probability density estimation. In this paper, we derive a sufficient condition for the existence and uniqueness of the log-concave maximum likelihood estimator on tree space. We also propose an estimation algorithm for one and two dimensions. Since various factors affect the inferred trees, it is difficult to specify the distribution of a sample of trees. The class of log-concave densities is nonparametric, and yet the estimation can be conducted by the maximum likelihood method without selecting hyperparameters. We compare the estimation performance with a previously developed kernel density estimator numerically. In our examples where the true density is log-concave, we demonstrate that our estimator has a smaller integrated squared error when the sample size is large. We also conduct numerical experiments of clustering using the Expectation-Maximization algorithm and compare the results with k-means++ clustering using Fréchet mean.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-22DOI: 10.1007/s11222-024-10388-7
Yannis G. Yatracos
Bootstrap and Jackknife estimates, (T_{n,B}^*) and (T_{n,J},) respectively, of a population parameter (theta ) are both used in statistical computations; n is the sample size, B is the number of Bootstrap samples. For any (n_0) and (B_0,) Bootstrap samples do not add new information about (theta ) being observations from the original sample and when (B_0<infty ,)(T_{n_0,B_0}^*) includes also resampling variability, an additional source of uncertainty not affecting (T_{n_0, J}.) These are neglected in theoretical papers with results for the utopian (T_{n, infty }^*, ) that do not hold for (B<infty .) The consequence is that (T^*_{n_0, B_0}) is expected to have larger mean squared error (MSE) than (T_{n_0,J},) namely (T_{n_0,B_0}^*) is inadmissible. The amount of inadmissibility may be very large when populations’ parameters, e.g. the variance, are unbounded and/or with big data. A palliating remedy is increasing B, the larger the better, but the MSEs ordering remains unchanged for (B<infty .) This is confirmed theoretically when (theta ) is the mean of a population, and is observed in the estimated total MSE for linear regression coefficients. In the latter, the chance the estimated total MSE with (T_{n,B}^*) improves that with (T_{n,J}) decreases to 0 as B increases.
{"title":"Do applied statisticians prefer more randomness or less? Bootstrap or Jackknife?","authors":"Yannis G. Yatracos","doi":"10.1007/s11222-024-10388-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10388-7","url":null,"abstract":"<p>Bootstrap and Jackknife estimates, <span>(T_{n,B}^*)</span> and <span>(T_{n,J},)</span> respectively, of a population parameter <span>(theta )</span> are both used in statistical computations; <i>n</i> is the sample size, <i>B</i> is the number of Bootstrap samples. For any <span>(n_0)</span> and <span>(B_0,)</span> Bootstrap samples do not add new information about <span>(theta )</span> being observations from the original sample and when <span>(B_0<infty ,)</span> <span>(T_{n_0,B_0}^*)</span> includes also resampling variability, an additional source of uncertainty not affecting <span>(T_{n_0, J}.)</span> These are neglected in theoretical papers with results for the utopian <span>(T_{n, infty }^*, )</span> that do not hold for <span>(B<infty .)</span> The consequence is that <span>(T^*_{n_0, B_0})</span> is expected to have larger mean squared error (MSE) than <span>(T_{n_0,J},)</span> namely <span>(T_{n_0,B_0}^*)</span> is inadmissible. The amount of inadmissibility may be very large when populations’ parameters, e.g. the variance, are unbounded and/or with big data. A palliating remedy is increasing <i>B</i>, the larger the better, but the MSEs ordering remains unchanged for <span>(B<infty .)</span> This is confirmed theoretically when <span>(theta )</span> is the mean of a population, and is observed in the estimated total MSE for linear regression coefficients. In the latter, the chance the estimated total MSE with <span>(T_{n,B}^*)</span> improves that with <span>(T_{n,J})</span> decreases to 0 as <i>B</i> increases.\u0000</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1007/s11222-024-10395-8
Nicholas Kissel, Lucas Mentch
Most scientific publications follow the familiar recipe of (i) obtain data, (ii) fit a model, and (iii) comment on the scientific relevance of the effects of particular covariates in that model. This approach, however, ignores the fact that there may exist a multitude of similarly-accurate models in which the implied effects of individual covariates may be vastly different. This problem of finding an entire collection of plausible models has also received relatively little attention in the statistics community, with nearly all of the proposed methodologies being narrowly tailored to a particular model class and/or requiring an exhaustive search over all possible models, making them largely infeasible in the current big data era. This work develops the idea of forward stability and proposes a novel, computationally-efficient approach to finding collections of accurate models we refer to as model path selection (MPS). MPS builds up a plausible model collection via a forward selection approach and is entirely agnostic to the model class and loss function employed. The resulting model collection can be displayed in a simple and intuitive graphical fashion, easily allowing practitioners to visualize whether some covariates can be swapped for others with minimal loss.
{"title":"Forward stability and model path selection","authors":"Nicholas Kissel, Lucas Mentch","doi":"10.1007/s11222-024-10395-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10395-8","url":null,"abstract":"<p>Most scientific publications follow the familiar recipe of (i) obtain data, (ii) fit a model, and (iii) comment on the scientific relevance of the effects of particular covariates in that model. This approach, however, ignores the fact that there may exist a multitude of similarly-accurate models in which the implied effects of individual covariates may be vastly different. This problem of finding an entire collection of plausible models has also received relatively little attention in the statistics community, with nearly all of the proposed methodologies being narrowly tailored to a particular model class and/or requiring an exhaustive search over all possible models, making them largely infeasible in the current big data era. This work develops the idea of forward stability and proposes a novel, computationally-efficient approach to finding collections of accurate models we refer to as model path selection (MPS). MPS builds up a plausible model collection via a forward selection approach and is entirely agnostic to the model class and loss function employed. The resulting model collection can be displayed in a simple and intuitive graphical fashion, easily allowing practitioners to visualize whether some covariates can be swapped for others with minimal loss.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}