Statistics and Computing最新文献

英文中文

Jittering and clustering: strategies for the construction of robust designs 抖动和聚类：构建稳健设计的策略

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-06-04 DOI: 10.1007/s11222-024-10436-2

Douglas P. Wiens

We discuss, and give examples of, methods for randomly implementing some minimax robust designs from the literature. These have the advantage, over their deterministic counterparts, of having bounded maximum loss in large and very rich neighbourhoods of the, almost certainly inexact, response model fitted by the experimenter. Their maximum loss rivals that of the theoretically best possible, but not implementable, minimax designs. The procedures are then extended to more general robust designs. For two-dimensional designs we sample from contractions of Voronoi tessellations, generated by selected basis points, which partition the design space. These ideas are then extended to k-dimensional designs for general k.

我们将讨论并举例说明随机实施文献中某些最小稳健设计的方法。与确定性设计相比，这些设计的优点是在实验者拟合的反应模型（几乎可以肯定是不精确的）的大而丰富的邻域内具有有界最大损失。它们的最大损失可与理论上最佳但无法实施的最小设计相媲美。然后，我们将程序扩展到更一般的稳健设计。对于二维设计，我们从由选定基点生成的 Voronoi 网格收缩中进行采样，从而分割设计空间。然后，我们将这些想法扩展到一般 k 的 k 维设计。

引用次数: 0

Testing the goodness-of-fit of the stable distributions with applications to German stock index data and Bitcoin cryptocurrency data 应用德国股票指数数据和比特币加密货币数据检验稳定分布的拟合优度

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-06-03 DOI: 10.1007/s11222-024-10441-5

Ruhul Ali Khan, Ayan Pal, Debasis Kundu

Outlier-prone data sets are of immense interest in diverse areas including economics, finance, statistical physics, signal processing, telecommunications and so on. Stable laws (also known as (alpha )- stable laws) are often found to be useful in modeling outlier-prone data containing important information and exhibiting heavy tailed phenomenon. In this article, an asymptotic distribution of a unbiased and consistent estimator of the stability index (alpha ) is proposed based on jackknife empirical likelihood (JEL) and adjusted JEL method. Next, using the sum-preserving property of stable random variables and exploiting U-statistic theory, we have developed a goodness-of-fit test procedure for (alpha )-stable distributions where the stability index (alpha ) is specified. Extensive simulation studies are performed in order to assess the finite sample performance of the proposed test. Finally, two appealing real life data examples related to the daily closing price of German Stock Index and Bitcoin cryptocurrency are analysed in detail for illustration purposes.

离群值数据集在经济、金融、统计物理、信号处理、电信等多个领域都有着巨大的意义。稳定规律（也称为 (α)- 稳定规律）经常被用来模拟包含重要信息并表现出重尾现象的离群易变数据。本文基于杰克刀经验似然法（JEL）和调整JEL法，提出了稳定指数(alpha )的无偏一致估计值的渐近分布。接下来，我们利用稳定随机变量的保和性并利用 U 统计理论，为指定了稳定指数（）的 (α )-稳定分布建立了拟合优度检验程序。为了评估所提出的测试的有限样本性能，进行了广泛的模拟研究。最后，为了说明问题，详细分析了与德国股票指数和比特币加密货币每日收盘价相关的两个有吸引力的现实生活数据示例。

{"title":"Testing the goodness-of-fit of the stable distributions with applications to German stock index data and Bitcoin cryptocurrency data","authors":"Ruhul Ali Khan, Ayan Pal, Debasis Kundu","doi":"10.1007/s11222-024-10441-5","DOIUrl":"https://doi.org/10.1007/s11222-024-10441-5","url":null,"abstract":"Outlier-prone data sets are of immense interest in diverse areas including economics, finance, statistical physics, signal processing, telecommunications and so on. Stable laws (also known as (alpha )- stable laws) are often found to be useful in modeling outlier-prone data containing important information and exhibiting heavy tailed phenomenon. In this article, an asymptotic distribution of a unbiased and consistent estimator of the stability index (alpha ) is proposed based on jackknife empirical likelihood (JEL) and adjusted JEL method. Next, using the sum-preserving property of stable random variables and exploiting U-statistic theory, we have developed a goodness-of-fit test procedure for (alpha )-stable distributions where the stability index (alpha ) is specified. Extensive simulation studies are performed in order to assess the finite sample performance of the proposed test. Finally, two appealing real life data examples related to the daily closing price of German Stock Index and Bitcoin cryptocurrency are analysed in detail for illustration purposes.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"75 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141259103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Insufficient Gibbs sampling 吉布斯采样不足

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-31 DOI: 10.1007/s11222-024-10423-7

Antoine Luciano, Christian P. Robert, Robin J. Ryder

In some applied scenarios, the availability of complete data is restricted, often due to privacy concerns; only aggregated, robust and inefficient statistics derived from the data are made accessible. These robust statistics are not sufficient, but they demonstrate reduced sensitivity to outliers and offer enhanced data protection due to their higher breakdown point. We consider a parametric framework and propose a method to sample from the posterior distribution of parameters conditioned on various robust and inefficient statistics: specifically, the pairs (median, MAD) or (median, IQR), or a collection of quantiles. Our approach leverages a Gibbs sampler and simulates latent augmented data, which facilitates simulation from the posterior distribution of parameters belonging to specific families of distributions. A by-product of these samples from the joint posterior distribution of parameters and data given the observed statistics is that we can estimate Bayes factors based on observed statistics via bridge sampling. We validate and outline the limitations of the proposed methods through toy examples and an application to real-world income data.

在某些应用场景中，往往出于隐私考虑，完整数据的可用性受到限制；只能获取从数据中得出的汇总、稳健和低效统计数据。这些稳健的统计数据并不充分，但它们对异常值的敏感度较低，而且由于其击穿点较高，可提供更强的数据保护。我们考虑了一个参数框架，并提出了一种从参数的后验分布中进行采样的方法，其条件是各种稳健和低效统计量：具体来说，就是成对的（中位数、MAD）或（中位数、IQR），或一组量值。我们的方法利用吉布斯采样器并模拟潜在的增强数据，这有助于从属于特定分布系列的参数后验分布中进行模拟。从参数和数据的联合后验分布（给定观测统计量）中采样的一个副产品是，我们可以通过桥采样根据观测统计量估算贝叶斯系数。我们通过玩具示例和对现实世界收入数据的应用，验证并概述了所提方法的局限性。

引用次数: 0

Optimization of the generalized covariance estimator in noncausal processes 非因果过程中广义协方差估计器的优化

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-31 DOI: 10.1007/s11222-024-10437-1

Gianluca Cubadda, Francesco Giancaterini, Alain Hecq, Joann Jasiak

This paper investigates the performance of routinely used optimization algorithms in application to the Generalized Covariance estimator (GCov) for univariate and multivariate mixed causal and noncausal models. The GCov is a semi-parametric estimator with an objective function based on nonlinear autocovariances to identify causal and noncausal orders. When the number and type of nonlinear autocovariances included in the objective function are insufficient/inadequate, or the error density is too close to the Gaussian, identification issues can arise. These issues result in local minima in the objective function, which correspond to parameter values associated with incorrect causal and noncausal orders. Then, depending on the starting point and the optimization algorithm employed, the algorithm can converge to a local minimum. The paper proposes the Simulated Annealing (SA) optimization algorithm as an alternative to conventional numerical optimization methods. The results demonstrate that SA performs well in its application to mixed causal and noncausal models, successfully eliminating the effects of local minima. The proposed approach is illustrated by an empirical study of a bivariate series of commodity prices.

本文研究了常规优化算法在单变量和多变量混合因果和非因果模型的广义协方差估计器（GCov）应用中的性能。GCov 是一种半参数估计器，其目标函数基于非线性自变量，用于识别因果和非因果阶次。当目标函数中包含的非线性自变量的数量和类型不足/不充分，或误差密度过于接近高斯时，就会出现识别问题。这些问题会导致目标函数出现局部极小值，而局部极小值与不正确的因果和非因果阶次相关的参数值相对应。然后，根据起点和所采用的优化算法，算法会收敛到局部最小值。本文提出了模拟退火（SA）优化算法，以替代传统的数值优化方法。结果表明，SA 在应用于混合因果和非因果模型时表现良好，成功消除了局部最小值的影响。通过对商品价格二元序列的实证研究，对所提出的方法进行了说明。

{"title":"Optimization of the generalized covariance estimator in noncausal processes","authors":"Gianluca Cubadda, Francesco Giancaterini, Alain Hecq, Joann Jasiak","doi":"10.1007/s11222-024-10437-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10437-1","url":null,"abstract":"This paper investigates the performance of routinely used optimization algorithms in application to the Generalized Covariance estimator (GCov) for univariate and multivariate mixed causal and noncausal models. The GCov is a semi-parametric estimator with an objective function based on nonlinear autocovariances to identify causal and noncausal orders. When the number and type of nonlinear autocovariances included in the objective function are insufficient/inadequate, or the error density is too close to the Gaussian, identification issues can arise. These issues result in local minima in the objective function, which correspond to parameter values associated with incorrect causal and noncausal orders. Then, depending on the starting point and the optimization algorithm employed, the algorithm can converge to a local minimum. The paper proposes the Simulated Annealing (SA) optimization algorithm as an alternative to conventional numerical optimization methods. The results demonstrate that SA performs well in its application to mixed causal and noncausal models, successfully eliminating the effects of local minima. The proposed approach is illustrated by an empirical study of a bivariate series of commodity prices.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"2010 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141190197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A modified EM-type algorithm to estimate semi-parametric mixtures of non-parametric regressions 估计非参数回归半参数混合物的改进型 EM 算法

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-29 DOI: 10.1007/s11222-024-10435-3

Sphiwe B. Skhosana, Salomon M. Millard, Frans H. J. Kanfer

Semi-parametric Gaussian mixtures of non-parametric regressions (SPGMNRs) are a flexible extension of Gaussian mixtures of linear regressions (GMLRs). The model assumes that the component regression functions (CRFs) are non-parametric functions of the covariate(s) whereas the component mixing proportions and variances are constants. Unfortunately, the model cannot be reliably estimated using traditional methods. A local-likelihood approach for estimating the CRFs requires that we maximize a set of local-likelihood functions. Using the Expectation-Maximization (EM) algorithm to separately maximize each local-likelihood function may lead to label-switching. This is because the posterior probabilities calculated at the local E-step are not guaranteed to be aligned. The consequence of this label-switching is wiggly and non-smooth estimates of the CRFs. In this paper, we propose a unified approach to address label-switching and obtain sensible estimates. The proposed approach has two stages. In the first stage, we propose a model-based approach to address the label-switching problem. We first note that each local-likelihood function is a likelihood function of a Gaussian mixture model (GMM). Next, we reformulate the SPGMNRs model as a mixture of these GMMs. Lastly, using a modified version of the Expectation Conditional Maximization (ECM) algorithm, we estimate the mixture of GMMs. In addition, using the mixing weights of the local GMMs, we can automatically choose the local points where local-likelihood estimation takes place. In the second stage, we propose one-step backfitting estimates of the parametric and non-parametric terms. The effectiveness of the proposed approach is demonstrated on simulated data and real data analysis.

非参数回归半参数高斯混合物（SPGMNRs）是线性回归高斯混合物（GMLRs）的灵活扩展。该模型假设成分回归函数（CRF）是协变量的非参数函数，而成分混合比例和方差是常数。遗憾的是，使用传统方法无法可靠地估计该模型。估计 CRF 的局部似然法要求我们最大化一组局部似然函数。使用期望最大化（EM）算法分别最大化每个局部似然函数可能会导致标签切换。这是因为在局部 E 步计算出的后验概率不能保证一致。这种标签切换的后果是 CRF 的估计值摇摆不定且不平滑。在本文中，我们提出了一种统一的方法来解决标签切换问题，并获得合理的估计值。我们提出的方法分为两个阶段。在第一阶段，我们提出了一种基于模型的方法来解决标签切换问题。我们首先指出，每个局部似然函数都是高斯混合模型（GMM）的似然函数。接下来，我们将 SPGMNRs 模型重新表述为这些 GMM 的混合物。最后，我们使用改进版的期望条件最大化（ECM）算法来估计 GMM 混合物。此外，利用局部 GMM 的混合权重，我们可以自动选择进行局部似然估计的局部点。在第二阶段，我们提出了参数和非参数项的一步反拟合估计。我们通过模拟数据和实际数据分析证明了所提方法的有效性。

{"title":"A modified EM-type algorithm to estimate semi-parametric mixtures of non-parametric regressions","authors":"Sphiwe B. Skhosana, Salomon M. Millard, Frans H. J. Kanfer","doi":"10.1007/s11222-024-10435-3","DOIUrl":"https://doi.org/10.1007/s11222-024-10435-3","url":null,"abstract":"Semi-parametric Gaussian mixtures of non-parametric regressions (SPGMNRs) are a flexible extension of Gaussian mixtures of linear regressions (GMLRs). The model assumes that the component regression functions (CRFs) are non-parametric functions of the covariate(s) whereas the component mixing proportions and variances are constants. Unfortunately, the model cannot be reliably estimated using traditional methods. A local-likelihood approach for estimating the CRFs requires that we maximize a set of local-likelihood functions. Using the Expectation-Maximization (EM) algorithm to separately maximize each local-likelihood function may lead to label-switching. This is because the posterior probabilities calculated at the local E-step are not guaranteed to be aligned. The consequence of this label-switching is wiggly and non-smooth estimates of the CRFs. In this paper, we propose a unified approach to address label-switching and obtain sensible estimates. The proposed approach has two stages. In the first stage, we propose a model-based approach to address the label-switching problem. We first note that each local-likelihood function is a likelihood function of a Gaussian mixture model (GMM). Next, we reformulate the SPGMNRs model as a mixture of these GMMs. Lastly, using a modified version of the Expectation Conditional Maximization (ECM) algorithm, we estimate the mixture of GMMs. In addition, using the mixing weights of the local GMMs, we can automatically choose the local points where local-likelihood estimation takes place. In the second stage, we propose one-step backfitting estimates of the parametric and non-parametric terms. The effectiveness of the proposed approach is demonstrated on simulated data and real data analysis.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"62 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalized fused Lasso for grouped data in generalized linear models 广义线性模型中分组数据的广义融合拉索（Generalized fused Lasso

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-25 DOI: 10.1007/s11222-024-10433-5

Mineaki Ohishi

Generalized fused Lasso (GFL) is a powerful method based on adjacent relationships or the network structure of data. It is used in a number of research areas, including clustering, discrete smoothing, and spatio-temporal analysis. When applying GFL, the specific optimization method used is an important issue. In generalized linear models, efficient algorithms based on the coordinate descent method have been developed for trend filtering under the binomial and Poisson distributions. However, to apply GFL to other distributions, such as the negative binomial distribution, which is used to deal with overdispersion in the Poisson distribution, or the gamma and inverse Gaussian distributions, which are used for positive continuous data, an algorithm for each individual distribution must be developed. To unify GFL for distributions in the exponential family, this paper proposes a coordinate descent algorithm for generalized linear models. To illustrate the method, a real data example of spatio-temporal analysis is provided.

广义融合套索（GFL）是一种基于数据相邻关系或网络结构的强大方法。它被用于聚类、离散平滑和时空分析等多个研究领域。在应用广义线性模型时，所使用的具体优化方法是一个重要问题。在广义线性模型中，已经开发出基于坐标下降法的高效算法，用于二项分布和泊松分布下的趋势过滤。然而，要将 GFL 应用于其他分布，如用于处理泊松分布过度分散的负二项分布，或用于正连续数据的伽马分布和反高斯分布，就必须为每种分布开发一种算法。为了统一指数族分布的 GFL，本文提出了广义线性模型的坐标下降算法。为了说明该方法，本文提供了一个时空分析的真实数据示例。

引用次数: 0

Type I Tobit Bayesian Additive Regression Trees for censored outcome regression 用于删减结果回归的 I 类托比特贝叶斯加法回归树

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-24 DOI: 10.1007/s11222-024-10434-4

Eoghan O’Neill

Censoring occurs when an outcome is unobserved beyond some threshold value. Methods that do not account for censoring produce biased predictions of the unobserved outcome. This paper introduces Type I Tobit Bayesian Additive Regression Tree (TOBART-1) models for censored outcomes. Simulation results and real data applications demonstrate that TOBART-1 produces accurate predictions of censored outcomes. TOBART-1 provides posterior intervals for the conditional expectation and other quantities of interest. The error term distribution can have a large impact on the expectation of the censored outcome. Therefore, the error is flexibly modeled as a Dirichlet process mixture of normal distributions. An R package is available at https://github.com/EoghanONeill/TobitBART.

当一个结果的观测值超过某个临界值时，就会发生删减。不考虑删减的方法对未观察结果的预测会产生偏差。本文介绍了针对剔除结果的 I 类托比特贝叶斯加性回归树（TOBART-1）模型。模拟结果和实际数据应用证明，TOBART-1 能准确预测剔除结果。TOBART-1 提供了条件期望和其他相关量的后验区间。误差项的分布会对删减结果的期望值产生很大影响。因此，误差被灵活地建模为正态分布的 Dirichlet 过程混合物。R 软件包见 https://github.com/EoghanONeill/TobitBART。

引用次数: 0

Fused lasso nearly-isotonic signal approximation in general dimensions 一般维度下的融合套索近似等调信号近似法

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-22 DOI: 10.1007/s11222-024-10432-6

Vladimir Pastukhov

In this paper, we introduce and study fused lasso nearly-isotonic signal approximation, which is a combination of fused lasso and generalized nearly-isotonic regression. We show how these three estimators relate to each other and derive solution to a general problem. Our estimator is computationally feasible and provides a trade-off between monotonicity, block sparsity, and goodness-of-fit. Next, we prove that fusion and near-isotonisation in a one-dimensional case can be applied interchangably, and this step-wise procedure gives the solution to the original optimization problem. This property of the estimator is very important, because it provides a direct way to construct a path solution when one of the penalization parameters is fixed. Also, we derive an unbiased estimator of degrees of freedom of the estimator.

本文介绍并研究了融合套索近等信号近似法，它是融合套索和广义近等回归的结合。我们展示了这三种估计器之间的关系，并推导出一般问题的解决方案。我们的估计器在计算上是可行的，并能在单调性、块稀疏性和拟合度之间进行权衡。接下来，我们证明了一维情况下的融合和近等子化可以互换应用，并且这种分步过程给出了原始优化问题的解决方案。估计器的这一特性非常重要，因为当其中一个惩罚参数固定时，它提供了构建路径解的直接方法。此外，我们还推导出了估计器自由度的无偏估计器。

引用次数: 0

Bayesian cross-validation by parallel Markov chain Monte Carlo 通过并行马尔科夫链蒙特卡罗进行贝叶斯交叉验证

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-21 DOI: 10.1007/s11222-024-10404-w

Alex Cooper, Aki Vehtari, Catherine Forbes, Dan Simpson, Lauren Kennedy

Brute force cross-validation (CV) is a method for predictive assessment and model selection that is general and applicable to a wide range of Bayesian models. Naive or ‘brute force’ CV approaches are often too computationally costly for interactive modeling workflows, especially when inference relies on Markov chain Monte Carlo (MCMC). We propose overcoming this limitation using massively parallel MCMC. Using accelerator hardware such as graphics processor units, our approach can be about as fast (in wall clock time) as a single full-data model fit. Parallel CV is flexible because it can easily exploit a wide range data partitioning schemes, such as those designed for non-exchangeable data. It can also accommodate a range of scoring rules. We propose MCMC diagnostics, including a summary of MCMC mixing based on the popular potential scale reduction factor ((widehat{textrm{R}})) and MCMC effective sample size ((widehat{textrm{ESS}})) measures. We also describe a method for determining whether an (widehat{textrm{R}}) diagnostic indicates approximate stationarity of the chains, that may be of more general interest for applications beyond parallel CV. Finally, we show that parallel CV and its diagnostics can be implemented with online algorithms, allowing parallel CV to scale up to very large blocking designs on memory-constrained computing accelerators.

强制交叉验证（CV）是一种用于预测评估和模型选择的方法，具有通用性，适用于各种贝叶斯模型。对于交互式建模工作流来说，天真或 "蛮力 "交叉验证方法的计算成本往往过高，尤其是当推理依赖于马尔可夫链蒙特卡罗（MCMC）时。我们建议使用大规模并行 MCMC 来克服这一限制。利用图形处理器等加速器硬件，我们的方法可以达到与单个全数据模型拟合同样快的速度（按挂钟时间计算）。并行 CV 非常灵活，因为它可以轻松利用各种数据分区方案，例如为不可交换数据设计的方案。它还能适应一系列评分规则。我们提出了 MCMC 诊断方法，包括基于流行的潜在规模缩减因子（(widehat{textrm{R}}）和 MCMC 有效样本大小（(widehat{textrm{ESS}}））测量方法的 MCMC 混合总结。我们还描述了一种方法，用于确定（(widehat{textrm{R}}）诊断是否表明链的近似静止性，这可能对并行 CV 以外的应用具有更普遍的意义。最后，我们展示了并行 CV 及其诊断可以通过在线算法实现，从而允许并行 CV 在内存受限的计算加速器上扩展到非常大的阻塞设计。

{"title":"Bayesian cross-validation by parallel Markov chain Monte Carlo","authors":"Alex Cooper, Aki Vehtari, Catherine Forbes, Dan Simpson, Lauren Kennedy","doi":"10.1007/s11222-024-10404-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10404-w","url":null,"abstract":"Brute force cross-validation (CV) is a method for predictive assessment and model selection that is general and applicable to a wide range of Bayesian models. Naive or ‘brute force’ CV approaches are often too computationally costly for interactive modeling workflows, especially when inference relies on Markov chain Monte Carlo (MCMC). We propose overcoming this limitation using massively parallel MCMC. Using accelerator hardware such as graphics processor units, our approach can be about as fast (in wall clock time) as a single full-data model fit. Parallel CV is flexible because it can easily exploit a wide range data partitioning schemes, such as those designed for non-exchangeable data. It can also accommodate a range of scoring rules. We propose MCMC diagnostics, including a summary of MCMC mixing based on the popular potential scale reduction factor ((widehat{textrm{R}})) and MCMC effective sample size ((widehat{textrm{ESS}})) measures. We also describe a method for determining whether an (widehat{textrm{R}}) diagnostic indicates approximate stationarity of the chains, that may be of more general interest for applications beyond parallel CV. Finally, we show that parallel CV and its diagnostics can be implemented with online algorithms, allowing parallel CV to scale up to very large blocking designs on memory-constrained computing accelerators.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"60 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spike and slab Bayesian sparse principal component analysis 尖峰和板块贝叶斯稀疏主成分分析

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-05-13 DOI: 10.1007/s11222-024-10430-8

Yu-Chien Bo Ning, Ning Ning

Sparse principal component analysis (SPCA) is a popular tool for dimensionality reduction in high-dimensional data. However, there is still a lack of theoretically justified Bayesian SPCA methods that can scale well computationally. One of the major challenges in Bayesian SPCA is selecting an appropriate prior for the loadings matrix, considering that principal components are mutually orthogonal. We propose a novel parameter-expanded coordinate ascent variational inference (PX-CAVI) algorithm. This algorithm utilizes a spike and slab prior, which incorporates parameter expansion to cope with the orthogonality constraint. Besides comparing to two popular SPCA approaches, we introduce the PX-EM algorithm as an EM analogue to the PX-CAVI algorithm for comparison. Through extensive numerical simulations, we demonstrate that the PX-CAVI algorithm outperforms these SPCA approaches, showcasing its superiority in terms of performance. We study the posterior contraction rate of the variational posterior, providing a novel contribution to the existing literature. The PX-CAVI algorithm is then applied to study a lung cancer gene expression dataset. The (textsf{R}) package (textsf{VBsparsePCA}) with an implementation of the algorithm is available on the Comprehensive R Archive Network (CRAN).

稀疏主成分分析（SPCA）是一种常用的高维数据降维工具。然而，目前仍缺乏理论上合理、计算上可扩展的贝叶斯 SPCA 方法。考虑到主成分是相互正交的，贝叶斯 SPCA 的主要挑战之一是为载荷矩阵选择一个合适的先验值。我们提出了一种新颖的参数扩展坐标上升变异推理（PX-CAVI）算法。该算法利用尖峰和板块先验，结合参数扩展来应对正交约束。除了与两种流行的 SPCA 方法进行比较外，我们还引入了 PX-EM 算法作为 PX-CAVI 算法的 EM 类似算法进行比较。通过大量的数值模拟，我们证明了 PX-CAVI 算法的性能优于这些 SPCA 方法，展示了其在性能方面的优势。我们研究了变分后验的后验收缩率，为现有文献做出了新的贡献。然后，我们将 PX-CAVI 算法应用于研究肺癌基因表达数据集。带有该算法实现的 (textsf{R}) 软件包 (textsf{VBsparsePCA}) 可在综合 R 档案网络（CRAN）上获取。

{"title":"Spike and slab Bayesian sparse principal component analysis","authors":"Yu-Chien Bo Ning, Ning Ning","doi":"10.1007/s11222-024-10430-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10430-8","url":null,"abstract":"Sparse principal component analysis (SPCA) is a popular tool for dimensionality reduction in high-dimensional data. However, there is still a lack of theoretically justified Bayesian SPCA methods that can scale well computationally. One of the major challenges in Bayesian SPCA is selecting an appropriate prior for the loadings matrix, considering that principal components are mutually orthogonal. We propose a novel parameter-expanded coordinate ascent variational inference (PX-CAVI) algorithm. This algorithm utilizes a spike and slab prior, which incorporates parameter expansion to cope with the orthogonality constraint. Besides comparing to two popular SPCA approaches, we introduce the PX-EM algorithm as an EM analogue to the PX-CAVI algorithm for comparison. Through extensive numerical simulations, we demonstrate that the PX-CAVI algorithm outperforms these SPCA approaches, showcasing its superiority in terms of performance. We study the posterior contraction rate of the variational posterior, providing a novel contribution to the existing literature. The PX-CAVI algorithm is then applied to study a lung cancer gene expression dataset. The (textsf{R}) package (textsf{VBsparsePCA}) with an implementation of the algorithm is available on the Comprehensive R Archive Network (CRAN).","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"47 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140941321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistics and Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀