首页 > 最新文献

Computational Statistics最新文献

英文 中文
Sparse Bayesian multidimensional scaling(s). 稀疏贝叶斯多维尺度。
IF 1.4 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2026-01-01 Epub Date: 2025-12-24 DOI: 10.1007/s00180-025-01696-1
Ami Sheth, Aaron Smith, Andrew J Holbrook

Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require burdensome [Formula: see text] floating-point operations, where N is the number of data points. Thus, BMDS becomes impractical as N grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between [Formula: see text] and N. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to: (1) the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks and (2) the clustering of ArXiv manuscripts based on low-dimensional representations of article abstracts. In the first application, sBMDS contributes to holistic uncertainty quantification within a larger Bayesian hierarchical model. In the second, sBMDS approximates uncertainty quantification for a downstream modeling task.

贝叶斯多维缩放(BMDS)是一种概率降维工具,它允许对由对象对之间的不相似性组成的数据进行建模和可视化。尽管BMDS已被证明在贝叶斯系统发育推断中很有用,但它的似然和梯度计算需要繁琐的浮点运算,其中N是数据点的数量。因此,随着N的增大,BMDS变得不切实际。我们提出并比较了两种稀疏版本的BMDS (sBMDS),它们将对数似然和梯度计算应用于观察到的不相似矩阵数据的子集。Landmark sBMDS (L-sBMDS)提取列,带状sBMDS (B-sBMDS)提取数据的对角线。这些稀疏变体允许指定[公式:见文本]和n之间的时间复杂度。在简化设置下,我们证明了下采样距离矩阵的后验一致性。通过模拟,我们使用Metropolis-Hastings和hamilton蒙特卡洛算法检查了所有模型的准确性和计算效率。当将sBMDS似然和梯度应用于500、1000和5000个数据点,50个波段(地标)时,我们观察到大约3倍、10倍和40倍的速度,精度损失可以忽略不计;这些加速只会随着所考虑的数据大小而增加。最后,我们将sBMDS变体应用于:(1)多种流感亚型的系统地理建模,以更好地了解这些菌株如何通过全球航空运输网络传播;(2)基于文章摘要的低维表示对ArXiv手稿进行聚类。在第一个应用中,sBMDS有助于在更大的贝叶斯层次模型中进行整体不确定性量化。其次,sBMDS近似于下游建模任务的不确定性量化。
{"title":"Sparse Bayesian multidimensional scaling(s).","authors":"Ami Sheth, Aaron Smith, Andrew J Holbrook","doi":"10.1007/s00180-025-01696-1","DOIUrl":"10.1007/s00180-025-01696-1","url":null,"abstract":"<p><p>Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require burdensome [Formula: see text] floating-point operations, where <i>N</i> is the number of data points. Thus, BMDS becomes impractical as <i>N</i> grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between [Formula: see text] and <i>N</i>. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to: (1) the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks and (2) the clustering of ArXiv manuscripts based on low-dimensional representations of article abstracts. In the first application, sBMDS contributes to holistic uncertainty quantification within a larger Bayesian hierarchical model. In the second, sBMDS approximates uncertainty quantification for a downstream modeling task.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"41 1","pages":"12"},"PeriodicalIF":1.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12738595/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145851501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A stochastic approach to k-nearest neighbors search using a fixed radius method. 一种固定半径随机k近邻搜索方法。
IF 1.4 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2026-01-01 Epub Date: 2026-01-13 DOI: 10.1007/s00180-025-01674-7
Brahian Cano Urrego, Alexander Alsup, Jeffrey A Thompson, Devin C Koestler

This study aims to optimize the [Formula: see text]-nearest neighbors search (kNN search) by reducing the computational burden of the well-known Brute-force method while providing the same solution. While there exist rule-based approaches for reducing the computational burden of the kNN search, methods that use the stochastic patterns inherent to the data are lacking. Our method leverages data structures and probabilistic assumptions to enhance the scalability of the search. By focusing on the Training set where our neighbors reside, we define a sample space that limits the [Formula: see text]-nearest neighbors search to a smaller space. For each observation in the Query set (e.g., the set of observations for which a classification is desired), a fixed radius search is employed, with the radius stochastically linked to the desired number of neighbors. This approach allows us to find the [Formula: see text]-nearest neighbors using only a fraction of the entire Training set in contrast to the Brute-force method, which requires distances to be calculated between each observation in the Training set and each observation in the Query set. Through simulations and a theoretical computational complexity analysis, we demonstrate that our method outperforms the Brute-force approach, particularly when the Training and Query set sample sizes are large. In addition, a benchmarked comparison of our approach and the Brute-force method on an Alzheimer's disease data set further demonstrated this, showing a 27.57-fold improvement in total elapsed time. Overall, our stochastic approach significantly reduces the computational load of kNN search while maintaining accuracy, making it a viable alternative to traditional methods for large datasets.

本研究旨在优化[公式:见文本]-最近邻搜索(kNN搜索),在提供相同解决方案的同时,减少众所周知的暴力破解方法的计算负担。虽然存在基于规则的方法来减少kNN搜索的计算负担,但缺乏使用数据固有的随机模式的方法。我们的方法利用数据结构和概率假设来增强搜索的可扩展性。通过关注我们邻居所在的训练集,我们定义了一个样本空间,将[公式:见文本]-最近邻搜索限制在更小的空间内。对于查询集中的每个观测值(例如,需要分类的观测值集),使用固定半径搜索,半径随机链接到所需的邻居数量。这种方法允许我们只使用整个训练集的一小部分就能找到[公式:见文本]-最近邻,这与蛮力方法相反,蛮力方法需要计算训练集中每个观测值和查询集中每个观测值之间的距离。通过模拟和理论计算复杂性分析,我们证明了我们的方法优于暴力方法,特别是当训练集和查询集样本量很大时。此外,我们的方法和蛮力方法在阿尔茨海默病数据集上的基准比较进一步证明了这一点,显示总运行时间改善了27.57倍。总体而言,我们的随机方法显著降低了kNN搜索的计算负荷,同时保持了准确性,使其成为大型数据集传统方法的可行替代方案。
{"title":"A stochastic approach to k-nearest neighbors search using a fixed radius method.","authors":"Brahian Cano Urrego, Alexander Alsup, Jeffrey A Thompson, Devin C Koestler","doi":"10.1007/s00180-025-01674-7","DOIUrl":"10.1007/s00180-025-01674-7","url":null,"abstract":"<p><p>This study aims to optimize the [Formula: see text]-nearest neighbors search (kNN search) by reducing the computational burden of the well-known Brute-force method while providing the same solution. While there exist rule-based approaches for reducing the computational burden of the kNN search, methods that use the stochastic patterns inherent to the data are lacking. Our method leverages data structures and probabilistic assumptions to enhance the scalability of the search. By focusing on the Training set where our neighbors reside, we define a sample space that limits the [Formula: see text]-nearest neighbors search to a smaller space. For each observation in the Query set (e.g., the set of observations for which a classification is desired), a fixed radius search is employed, with the radius stochastically linked to the desired number of neighbors. This approach allows us to find the [Formula: see text]-nearest neighbors using only a fraction of the entire Training set in contrast to the Brute-force method, which requires distances to be calculated between each observation in the Training set and each observation in the Query set. Through simulations and a theoretical computational complexity analysis, we demonstrate that our method outperforms the Brute-force approach, particularly when the Training and Query set sample sizes are large. In addition, a benchmarked comparison of our approach and the Brute-force method on an Alzheimer's disease data set further demonstrated this, showing a 27.57-fold improvement in total elapsed time. Overall, our stochastic approach significantly reduces the computational load of kNN search while maintaining accuracy, making it a viable alternative to traditional methods for large datasets.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"41 1","pages":"27"},"PeriodicalIF":1.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12799653/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A latent class pattern mixture model for nonignorable nonresponses in multivariate categorical data. 多变量分类数据中不可忽略无响应的潜在类模式混合模型。
IF 1.4 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2025-11-01 Epub Date: 2025-05-01 DOI: 10.1007/s00180-025-01627-0
Jungwun Lee, Margaret Lloyd Sieger, Jon D Phillips

Survey data using categorical item variables are widely used in applied research such as psychology, education, and behavioral studies. Unfortunately, survey data are highly susceptible to nonignorable missing values that may threaten the validity of statistical inference if naively ignored or inappropriately treated. This paper proposes a novel latent pattern mixture model for nonignorable missing values in multivariate categorical outcomes. The proposed model posits the existence of two categorical latent variables; one latent variable represents a nonresponse pattern, and the other represents a response pattern conditioning on the nonresponse pattern. We propose two parameter estimation strategies: the maximum-likelihood (ML) estimation using the expectation-maximization (EM) algorithm and Bayesian estimation using the Markov-Chain Monte Carlo (MCMC) algorithm. Simulation studies revealed that the ML estimation is preferred to the Bayesian estimation with noninformative priors in terms of standardized biases given the large sample size, whereas the Bayesian estimation can be preferred when the sample size is small. Finally, our real data example analyzed a data set with parental substance use disorder and revealed six latent classes of participants that are distinguished in response and missingness patterns.

使用分类项目变量的调查数据在心理学、教育和行为研究等应用研究中被广泛使用。不幸的是,调查数据极易受到不可忽视的缺失值的影响,如果天真地忽视或处理不当,这些缺失值可能会威胁到统计推断的有效性。本文提出了一种新的多元分类结果中不可忽略缺失值的潜在模式混合模型。该模型假定存在两个分类潜在变量;一个潜在变量表示无反应模式,另一个潜在变量表示在无反应模式基础上的反应模式。我们提出了两种参数估计策略:使用期望最大化(EM)算法的最大似然(ML)估计和使用马尔可夫链蒙特卡罗(MCMC)算法的贝叶斯估计。仿真研究表明,在大样本量下,ML估计在标准化偏差方面优于无信息先验的贝叶斯估计,而在样本量小的情况下,贝叶斯估计更受青睐。最后,我们的真实数据示例分析了父母物质使用障碍的数据集,并揭示了六个潜在类别的参与者,这些参与者在响应和缺失模式上有所区别。
{"title":"A latent class pattern mixture model for nonignorable nonresponses in multivariate categorical data.","authors":"Jungwun Lee, Margaret Lloyd Sieger, Jon D Phillips","doi":"10.1007/s00180-025-01627-0","DOIUrl":"10.1007/s00180-025-01627-0","url":null,"abstract":"<p><p>Survey data using categorical item variables are widely used in applied research such as psychology, education, and behavioral studies. Unfortunately, survey data are highly susceptible to nonignorable missing values that may threaten the validity of statistical inference if naively ignored or inappropriately treated. This paper proposes a novel latent pattern mixture model for nonignorable missing values in multivariate categorical outcomes. The proposed model posits the existence of two categorical latent variables; one latent variable represents a nonresponse pattern, and the other represents a response pattern conditioning on the nonresponse pattern. We propose two parameter estimation strategies: the maximum-likelihood (ML) estimation using the expectation-maximization (EM) algorithm and Bayesian estimation using the Markov-Chain Monte Carlo (MCMC) algorithm. Simulation studies revealed that the ML estimation is preferred to the Bayesian estimation with noninformative priors in terms of standardized biases given the large sample size, whereas the Bayesian estimation can be preferred when the sample size is small. Finally, our real data example analyzed a data set with parental substance use disorder and revealed six latent classes of participants that are distinguished in response and missingness patterns.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 8","pages":"4367-4397"},"PeriodicalIF":1.4,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12867129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approximate Bayesian inference in a model for self-generated gradient collective cell movement. 自生成梯度集体细胞运动模型中的近似贝叶斯推理。
IF 1 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2025-01-01 Epub Date: 2025-03-08 DOI: 10.1007/s00180-025-01606-5
Jon Devlin, Agnieszka Borowska, Dirk Husmeier, John Mackenzie

In this article we explore parameter inference in a novel hybrid discrete-continuum model describing the movement of a population of cells in response to a self-generated chemotactic gradient. The model employs a drift-diffusion stochastic process, rendering likelihood-based inference methods impractical. Consequently, we consider approximate Bayesian computation (ABC) methods, which have gained popularity for models with intractable or computationally expensive likelihoods. ABC involves simulating from the generative model, using parameters from generated observations that are "close enough" to the true data to approximate the posterior distribution. Given the plethora of existing ABC methods, selecting the most suitable one for a specific problem can be challenging. To address this, we employ a simple drift-diffusion stochastic differential equation (SDE) as a benchmark problem. This allows us to assess the accuracy of popular ABC algorithms under known configurations. We also evaluate the bias between ABC-posteriors and the exact posterior for the basic SDE model, where the posterior distribution is tractable. The top-performing ABC algorithms are subsequently applied to the proposed cell movement model to infer its key parameters. This study not only contributes to understanding cell movement but also sheds light on the comparative efficiency of different ABC algorithms in a well-defined context.

在本文中,我们探讨了一种新的混合离散连续模型中的参数推理,该模型描述了一群细胞响应自生成的趋化梯度的运动。该模型采用漂移扩散随机过程,使得基于似然的推理方法不可行。因此,我们考虑近似贝叶斯计算(ABC)方法,这种方法在具有难以处理或计算昂贵的可能性的模型中得到了普及。ABC包括从生成模型中进行模拟,使用从生成的观测数据中“足够接近”真实数据的参数来近似后验分布。鉴于现有的ABC方法过多,为特定问题选择最合适的方法可能具有挑战性。为了解决这个问题,我们采用一个简单的漂移-扩散随机微分方程(SDE)作为基准问题。这使我们能够评估在已知配置下流行的ABC算法的准确性。我们还评估了基本SDE模型的abc -后验和精确后验之间的偏差,其中后验分布是可处理的。随后将表现最好的ABC算法应用于所提出的细胞运动模型,以推断其关键参数。这项研究不仅有助于理解细胞运动,而且还揭示了在明确定义的背景下不同ABC算法的比较效率。
{"title":"Approximate Bayesian inference in a model for self-generated gradient collective cell movement.","authors":"Jon Devlin, Agnieszka Borowska, Dirk Husmeier, John Mackenzie","doi":"10.1007/s00180-025-01606-5","DOIUrl":"10.1007/s00180-025-01606-5","url":null,"abstract":"<p><p>In this article we explore parameter inference in a novel hybrid discrete-continuum model describing the movement of a population of cells in response to a self-generated chemotactic gradient. The model employs a drift-diffusion stochastic process, rendering likelihood-based inference methods impractical. Consequently, we consider approximate Bayesian computation (ABC) methods, which have gained popularity for models with intractable or computationally expensive likelihoods. ABC involves simulating from the generative model, using parameters from generated observations that are \"close enough\" to the true data to approximate the posterior distribution. Given the plethora of existing ABC methods, selecting the most suitable one for a specific problem can be challenging. To address this, we employ a simple drift-diffusion stochastic differential equation (SDE) as a benchmark problem. This allows us to assess the accuracy of popular ABC algorithms under known configurations. We also evaluate the bias between ABC-posteriors and the exact posterior for the basic SDE model, where the posterior distribution is tractable. The top-performing ABC algorithms are subsequently applied to the proposed cell movement model to infer its key parameters. This study not only contributes to understanding cell movement but also sheds light on the comparative efficiency of different ABC algorithms in a well-defined context.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 7","pages":"3399-3452"},"PeriodicalIF":1.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12255578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144638687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A powerful penalized multinomial logistic regression approach. 一个强大的惩罚多项式逻辑回归方法。
IF 1.4 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2025-01-01 Epub Date: 2025-05-25 DOI: 10.1007/s00180-025-01635-0
Cornelia Fuetterer, Malte Nalenz, Thomas Augustin, Ruth M Pfeiffer

Penalized regression methods that shrink model coefficients are popular approaches to improve prediction and for variable selection in high-dimensional settings. We present a penalized (or regularized) regression approach for multinomial logistic models for categorical outcomes with a novel adaptive L1-type penalty term, that incorporates weights based on intra- and inter-outcome category distances of each predictor. A predictor that has large between- and small within-outcome category distances is penalized less and has a higher likelihood to be selected for the final model. We propose and study three measures for weight calculation: an analysis of variance (ANOVA)-based measure and two indices used in clustering approaches. Our novel approach, that we term the discriminative power lasso (DP-lasso), thus combines elements of marginal screening with regularized regression methods. We studied the performance of DP-lasso and other published methods in simulations with varying numbers of outcome categories, numbers of predictors, strengths of associations and predictor correlation structures. For correlated predictors, the DP-lasso approach with ANOVA based weights (DPan) resulted in much sparser models than other regularization approaches, especially in high-dimensional settings. When the number p of (correlated) predictors was much larger than the available sample size N, DPan had the highest true positive rate while maintaining low false positive rates for all simulation settings. Similarly, when p < N , DPan had high true positive rates and the lowest false positive rates of all methods studied. Thus we recommend DPan for analysing categorical outcomes in relation to high-dimensional predictors. We further illustrate all approaches in ultra high-dimensional settings, using several single-cell RNA-sequencing datasets.

Supplementary information: The online version contains supplementary material available at 10.1007/s00180-025-01635-0.

缩小模型系数的惩罚回归方法是在高维环境中改进预测和变量选择的常用方法。我们提出了一种针对分类结果的多项逻辑模型的惩罚(或正则化)回归方法,该方法具有新颖的自适应l1型惩罚项,该方法结合了基于每个预测器的结果类别内和类别间距离的权重。结果类别间距离较大和结果类别内距离较小的预测器受到的惩罚较小,并且有更高的可能性被选择为最终模型。我们提出并研究了权重计算的三种度量:基于方差分析(ANOVA)的度量和用于聚类方法的两个指标。我们的新方法,我们称之为判别力套索(dp -套索),因此结合了边际筛选和正则化回归方法的元素。我们研究了DP-lasso和其他已发表的方法在不同结果类别数量、预测因子数量、关联强度和预测因子相关结构的模拟中的性能。对于相关预测因子,基于方差分析的DP-lasso方法(DPan)比其他正则化方法产生更稀疏的模型,特别是在高维设置中。当(相关)预测因子的数量p远远大于可用样本量N时,DPan具有最高的真阳性率,同时在所有模拟设置中保持较低的假阳性率。同样,当p N时,DPan具有较高的真阳性率和最低的假阳性率。因此,我们推荐DPan用于分析与高维预测因子相关的分类结果。我们使用几个单细胞rna测序数据集进一步说明了超高维设置中的所有方法。补充信息:在线版本包含补充资料,可在10.1007/s00180-025-01635-0获得。
{"title":"A powerful penalized multinomial logistic regression approach.","authors":"Cornelia Fuetterer, Malte Nalenz, Thomas Augustin, Ruth M Pfeiffer","doi":"10.1007/s00180-025-01635-0","DOIUrl":"10.1007/s00180-025-01635-0","url":null,"abstract":"<p><p>Penalized regression methods that shrink model coefficients are popular approaches to improve prediction and for variable selection in high-dimensional settings. We present a penalized (or regularized) regression approach for multinomial logistic models for categorical outcomes with a novel adaptive L1-type penalty term, that incorporates weights based on intra- and inter-outcome category distances of each predictor. A predictor that has large between- and small within-outcome category distances is penalized less and has a higher likelihood to be selected for the final model. We propose and study three measures for weight calculation: an analysis of variance (ANOVA)-based measure and two indices used in clustering approaches. Our novel approach, that we term the <i>discriminative power lasso</i> (DP-lasso), thus combines elements of marginal screening with regularized regression methods. We studied the performance of DP-lasso and other published methods in simulations with varying numbers of outcome categories, numbers of predictors, strengths of associations and predictor correlation structures. For correlated predictors, the DP-lasso approach with ANOVA based weights (DPan) resulted in much sparser models than other regularization approaches, especially in high-dimensional settings. When the number <i>p</i> of (correlated) predictors was much larger than the available sample size <i>N</i>, DPan had the highest true positive rate while maintaining low false positive rates for all simulation settings. Similarly, when <math><mrow><mi>p</mi> <mo><</mo> <mi>N</mi></mrow> </math> , DPan had high true positive rates and the lowest false positive rates of all methods studied. Thus we recommend DPan for analysing categorical outcomes in relation to high-dimensional predictors. We further illustrate all approaches in ultra high-dimensional settings, using several single-cell RNA-sequencing datasets.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s00180-025-01635-0.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 8","pages":"4565-4587"},"PeriodicalIF":1.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12552268/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Misspecification-robust likelihood-free inference in high dimensions. 错误说明-高维鲁棒无似然推断。
IF 1.4 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2025-01-01 Epub Date: 2025-05-03 DOI: 10.1007/s00180-025-01607-4
Owen Thomas, Raquel Sá-Leão, Hermínia de Lencastre, Samuel Kaski, Jukka Corander, Henri Pesonen

Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions, discrepancies, and associated summary statistics for distinct subsets of the parameters. The efficient additive acquisition structure is combined with exponentiated loss-likelihood to provide a misspecification-robust characterisation of posterior distributions for subsets of model parameters. The method successfully performs computationally efficient inference in a moderately sized parameter space and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space.

基于仿真器的统计模型的无似然推断已经迅速从婴儿期发展成为实践者的有用工具。然而,对于基于近似贝叶斯计算(ABC)的推理来说,具有多个参数的模型通常仍然是一个挑战。为了提高在高维参数空间中执行无似然推理的可能性,我们引入了流行的基于贝叶斯优化的方法的扩展,以概率方式近似差异函数,这有助于对参数空间进行有效的探索。我们的方法通过对参数的不同子集使用单独的获取函数、差异和相关的汇总统计来实现高维参数空间的计算可扩展性。有效的附加获取结构与指数损失似然相结合,为模型参数子集的后验分布提供了错误规范的鲁棒性表征。该方法成功地在中等大小的参数空间中执行计算效率推断,与现有的模块化ABC方法相比具有优势。我们通过将细菌传播动力学模型拟合到真实数据集进一步说明了这种方法的潜力,该数据集提供了30维参数空间中菌株竞争的生物学一致结果。
{"title":"Misspecification-robust likelihood-free inference in high dimensions.","authors":"Owen Thomas, Raquel Sá-Leão, Hermínia de Lencastre, Samuel Kaski, Jukka Corander, Henri Pesonen","doi":"10.1007/s00180-025-01607-4","DOIUrl":"10.1007/s00180-025-01607-4","url":null,"abstract":"<p><p>Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions, discrepancies, and associated summary statistics for distinct subsets of the parameters. The efficient additive acquisition structure is combined with exponentiated loss-likelihood to provide a misspecification-robust characterisation of posterior distributions for subsets of model parameters. The method successfully performs computationally efficient inference in a moderately sized parameter space and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 8","pages":"4399-4439"},"PeriodicalIF":1.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12552272/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayes estimation of ratio of scale-like parameters for inverse Gaussian distributions and applications to classification 贝叶斯估计反高斯分布的比例类参数比率及其在分类中的应用
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-09-19 DOI: 10.1007/s00180-024-01554-6
Ankur Chakraborty, Nabakumar Jana

We consider two inverse Gaussian populations with a common mean but different scale-like parameters, where all parameters are unknown. We construct noninformative priors for the ratio of the scale-like parameters to derive matching priors of different orders. Reference priors are proposed for different groups of parameters. The Bayes estimators of the common mean and ratio of the scale-like parameters are also derived. We propose confidence intervals of the conditional error rate in classifying an observation into inverse Gaussian distributions. A generalized variable-based confidence interval and the highest posterior density credible intervals for the error rate are computed. We estimate parameters of the mixture of these inverse Gaussian distributions and obtain estimates of the expected probability of correct classification. An intensive simulation study has been carried out to compare the estimators and expected probability of correct classification. Real data-based examples are given to show the practicality and effectiveness of the estimators.

我们考虑两个具有共同均值但不同类比参数的反高斯群体,其中所有参数都是未知的。我们为类标度参数的比率构建了非信息前验,从而推导出不同阶次的匹配前验。我们还为不同的参数组提出了参考先验。我们还推导出了类比例参数的共同均值和比率的贝叶斯估计值。我们提出了将观测分类为反高斯分布的条件误差率置信区间。我们计算了基于变量的广义置信区间和误差率的最高后验密度可信区间。我们估计了这些逆高斯分布的混合物参数,并获得了正确分类的预期概率估计值。为了比较估计值和正确分类的预期概率,我们进行了深入的模拟研究。我们还给出了基于真实数据的示例,以展示估计器的实用性和有效性。
{"title":"Bayes estimation of ratio of scale-like parameters for inverse Gaussian distributions and applications to classification","authors":"Ankur Chakraborty, Nabakumar Jana","doi":"10.1007/s00180-024-01554-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01554-6","url":null,"abstract":"<p>We consider two inverse Gaussian populations with a common mean but different scale-like parameters, where all parameters are unknown. We construct noninformative priors for the ratio of the scale-like parameters to derive matching priors of different orders. Reference priors are proposed for different groups of parameters. The Bayes estimators of the common mean and ratio of the scale-like parameters are also derived. We propose confidence intervals of the conditional error rate in classifying an observation into inverse Gaussian distributions. A generalized variable-based confidence interval and the highest posterior density credible intervals for the error rate are computed. We estimate parameters of the mixture of these inverse Gaussian distributions and obtain estimates of the expected probability of correct classification. An intensive simulation study has been carried out to compare the estimators and expected probability of correct classification. Real data-based examples are given to show the practicality and effectiveness of the estimators.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"50 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate approaches to investigate the home and away behavior of football teams playing football matches 研究足球队主客场比赛行为的多元方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-09-17 DOI: 10.1007/s00180-024-01553-7
Antonello D’Ambra, Pietro Amenta, Antonio Lucadamo

Compared to other European competitions, participation in the Uefa Champions League is a real “bargain” for football clubs due to the hefty bonuses awarded based on performance during the group qualification phase. To perform successfully in football depends on several multidimensional factors, and analyzing the main ones remains challenging. In the performance study, little attention has been paid to teams’ behavior when playing at home and away. Our study combines statistical techniques to develop a procedure to examine teams’ performance. Several considerations make the 2022–2023 Serie A league season particularly interesting to analyze with our approach. Except for Napoli, all the teams showed different home-and-away behaviors concerning the results obtained at the season’s end. Ball possession and corners have positively influenced scored points in both home and away games with a different impact. The precision indicator was not an essential variable. The procedure highlighted the negative roles played by offside, as well as yellow and red cards.

与其他欧洲赛事相比,参加欧洲冠军联赛对足球俱乐部来说是真正的 "实惠",因为根据小组资格赛阶段的表现可获得高额奖金。要想在足球比赛中取得好成绩,取决于多个多维因素,而分析其中的主要因素仍具有挑战性。在成绩研究中,人们很少关注球队在主客场比赛中的表现。我们的研究结合了统计技术,制定了一套考察球队表现的程序。有几个因素使得 2022-2023 赛季的意甲联赛特别值得用我们的方法进行分析。除那不勒斯外,所有球队在赛季结束时的主客场表现都不尽相同。在主客场比赛中,控球率和角球对得分都有积极影响,但影响程度不同。精确度指标并非重要变量。该程序强调了越位以及黄牌和红牌的负面作用。
{"title":"Multivariate approaches to investigate the home and away behavior of football teams playing football matches","authors":"Antonello D’Ambra, Pietro Amenta, Antonio Lucadamo","doi":"10.1007/s00180-024-01553-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01553-7","url":null,"abstract":"<p>Compared to other European competitions, participation in the Uefa Champions League is a real “bargain” for football clubs due to the hefty bonuses awarded based on performance during the group qualification phase. To perform successfully in football depends on several multidimensional factors, and analyzing the main ones remains challenging. In the performance study, little attention has been paid to teams’ behavior when playing at home and away. Our study combines statistical techniques to develop a procedure to examine teams’ performance. Several considerations make the 2022–2023 Serie A league season particularly interesting to analyze with our approach. Except for Napoli, all the teams showed different home-and-away behaviors concerning the results obtained at the season’s end. Ball possession and corners have positively influenced scored points in both home and away games with a different impact. The precision indicator was not an essential variable. The procedure highlighted the negative roles played by offside, as well as yellow and red cards.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"2 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kendall correlations and radar charts to include goals for and goals against in soccer rankings 肯德尔相关性和雷达图,在足球排名中纳入进球数和失球数
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-09-17 DOI: 10.1007/s00180-024-01542-w
Roy Cerqueti, Raffaele Mattera, Valerio Ficcadenti

This paper deals with the challenging themes of the way sporting teams and athletes are ranked in sports competitions. Starting from the paradigmatic case of soccer, we advance a new method for ranking teams in the official national championships through computational statistics methods based on Kendall correlations and radar charts. In detail, we consider the goals for and against the teams in the individual matches as a further source of score assignment beyond the usual win-tie-lose trichotomy. Our approach overcomes some biases in the scoring rules that are currently employed. The methodological proposal is tested over the relevant case of the Italian “Serie A” championships played during 1930–2023.

本文探讨了体育比赛中运动队和运动员排名方式这一具有挑战性的主题。我们从足球这一典型案例出发,通过基于肯德尔相关性和雷达图的计算统计方法,提出了一种在官方全国锦标赛中对球队进行排名的新方法。具体而言,我们考虑了单场比赛中球队的进球数和失球数,将其作为除通常的胜平负三分法之外的另一种分数分配来源。我们的方法克服了目前采用的评分规则中的一些偏差。我们在 1930-2023 年期间举行的意大利甲级联赛冠军赛的相关案例中对这一方法建议进行了测试。
{"title":"Kendall correlations and radar charts to include goals for and goals against in soccer rankings","authors":"Roy Cerqueti, Raffaele Mattera, Valerio Ficcadenti","doi":"10.1007/s00180-024-01542-w","DOIUrl":"https://doi.org/10.1007/s00180-024-01542-w","url":null,"abstract":"<p>This paper deals with the challenging themes of the way sporting teams and athletes are ranked in sports competitions. Starting from the paradigmatic case of soccer, we advance a new method for ranking teams in the official national championships through computational statistics methods based on Kendall correlations and radar charts. In detail, we consider the goals for and against the teams in the individual matches as a further source of score assignment beyond the usual win-tie-lose trichotomy. Our approach overcomes some biases in the scoring rules that are currently employed. The methodological proposal is tested over the relevant case of the Italian “Serie A” championships played during 1930–2023.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"35 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian adaptive lasso quantile regression with non-ignorable missing responses 具有不可忽略的缺失响应的贝叶斯自适应套索量化回归
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-09-16 DOI: 10.1007/s00180-024-01546-6
Ranran Chen, Mai Dao, Keying Ye, Min Wang

In this paper, we develop a fully Bayesian adaptive lasso quantile regression model to analyze data with non-ignorable missing responses, which frequently occur in various fields of study. Specifically, we employ a logistic regression model to deal with missing data of non-ignorable mechanism. By using the asymmetric Laplace working likelihood for the data and specifying Laplace priors for the regression coefficients, our proposed method extends the Bayesian lasso framework by imposing specific penalization parameters on each regression coefficient, enhancing our estimation and variable selection capability. Furthermore, we embrace the normal-exponential mixture representation of the asymmetric Laplace distribution and the Student-t approximation of the logistic regression model to develop a simple and efficient Gibbs sampling algorithm for generating posterior samples and making statistical inferences. The finite-sample performance of the proposed algorithm is investigated through various simulation studies and a real-data example.

在本文中,我们开发了一种全贝叶斯自适应套索量子回归模型,用于分析在各个研究领域经常出现的不可忽略的缺失响应数据。具体来说,我们采用逻辑回归模型来处理不可忽略机制的缺失数据。通过对数据使用非对称拉普拉斯工作似然,并为回归系数指定拉普拉斯先验,我们提出的方法扩展了贝叶斯套索框架,对每个回归系数施加了特定的惩罚参数,从而增强了我们的估计和变量选择能力。此外,我们还采用了非对称拉普拉斯分布的正态-指数混合表示法和逻辑回归模型的 Student-t 近似方法,开发了一种简单高效的吉布斯抽样算法,用于生成后验样本并进行统计推断。通过各种模拟研究和一个真实数据示例,研究了所提算法的有限样本性能。
{"title":"Bayesian adaptive lasso quantile regression with non-ignorable missing responses","authors":"Ranran Chen, Mai Dao, Keying Ye, Min Wang","doi":"10.1007/s00180-024-01546-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01546-6","url":null,"abstract":"<p>In this paper, we develop a fully Bayesian adaptive lasso quantile regression model to analyze data with non-ignorable missing responses, which frequently occur in various fields of study. Specifically, we employ a logistic regression model to deal with missing data of non-ignorable mechanism. By using the asymmetric Laplace working likelihood for the data and specifying Laplace priors for the regression coefficients, our proposed method extends the Bayesian lasso framework by imposing specific penalization parameters on each regression coefficient, enhancing our estimation and variable selection capability. Furthermore, we embrace the normal-exponential mixture representation of the asymmetric Laplace distribution and the Student-<i>t</i> approximation of the logistic regression model to develop a simple and efficient Gibbs sampling algorithm for generating posterior samples and making statistical inferences. The finite-sample performance of the proposed algorithm is investigated through various simulation studies and a real-data example.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"94 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1