首页 > 最新文献

Statistics and Computing最新文献

英文 中文
Unifying Summary Statistic Selection for Approximate Bayesian Computation. 近似贝叶斯计算的统一汇总统计选择。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-01 Epub Date: 2026-01-27 DOI: 10.1007/s11222-025-10808-2
Till Hoffmann, Jukka-Pekka Onnela

Extracting low-dimensional summary statistics from large datasets is essential for efficient (likelihood-free) inference. We characterize three different classes of summaries and demonstrate their importance for correctly analyzing dimensionality reduction algorithms. We demonstrate that minimizing the expected posterior entropy (EPE) under the prior predictive distribution of the model provides a unifying principle that subsumes many existing methods; they are shown to be equivalent to, or special or limiting cases of, minimizing the EPE. We offer a unifying framework for obtaining informative summaries and propose a practical method using conditional density estimation to learn high-fidelity summaries automatically. We evaluate this approach on diverse problems, including a challenging benchmark model with a multi-modal posterior, a population genetics model, and a dynamic network model of growing trees. The results show that EPE-minimizing summaries can lead to posterior inference that is competitive with, and in some cases superior to, dedicated likelihood-based approaches, providing a powerful and general tool for practitioners.

从大型数据集中提取低维汇总统计对于高效(无似然)推理至关重要。我们描述了三种不同类型的摘要,并证明了它们对正确分析降维算法的重要性。我们证明了在模型的先验预测分布下最小化期望后验熵(EPE)提供了一个统一的原则,包含了许多现有的方法;它们被证明是等价的,或者是最小化EPE的特殊或极限情况。我们提供了一个获取信息摘要的统一框架,并提出了一种使用条件密度估计自动学习高保真摘要的实用方法。我们在不同的问题上评估了这种方法,包括一个具有多模态后验的具有挑战性的基准模型、一个群体遗传学模型和一个生长树木的动态网络模型。结果表明,最小化epe总结可以导致后验推理,在某些情况下优于专用的基于似然的方法,为从业者提供了一个强大而通用的工具。
{"title":"Unifying Summary Statistic Selection for Approximate Bayesian Computation.","authors":"Till Hoffmann, Jukka-Pekka Onnela","doi":"10.1007/s11222-025-10808-2","DOIUrl":"10.1007/s11222-025-10808-2","url":null,"abstract":"<p><p>Extracting low-dimensional summary statistics from large datasets is essential for efficient (likelihood-free) inference. We characterize three different classes of summaries and demonstrate their importance for correctly analyzing dimensionality reduction algorithms. We demonstrate that minimizing the expected posterior entropy (EPE) under the prior predictive distribution of the model provides a unifying principle that subsumes many existing methods; they are shown to be equivalent to, or special or limiting cases of, minimizing the EPE. We offer a unifying framework for obtaining informative summaries and propose a practical method using conditional density estimation to learn high-fidelity summaries automatically. We evaluate this approach on diverse problems, including a challenging benchmark model with a multi-modal posterior, a population genetics model, and a dynamic network model of growing trees. The results show that EPE-minimizing summaries can lead to posterior inference that is competitive with, and in some cases superior to, dedicated likelihood-based approaches, providing a powerful and general tool for practitioners.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"36 2","pages":"70"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12847231/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146087359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Support vector machine-based mixture cure model for mixed case interval censored data. 基于支持向量机的混合病例间隔截尾数据混合修复模型。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-01 Epub Date: 2026-01-16 DOI: 10.1007/s11222-025-10796-3
Suvra Pal, Wisdom Aselisewine

We propose a semi-parametric two-component model for the analysis of mixed case interval censored (MCIC) data with a cured subgroup. Such data occurs when the time to an event of interest is only known to belong to an interval obtained from a sequence of, say, k random examination time points with k representing an integer. Furthermore, there is a proportion of subjects who would never be susceptible to the event. The first component of the proposed model describes the probability of cure, and it replaces the traditional generalized linear model with a more flexible support vector machine (SVM)-based approach capable of capturing complex covariate effects. The second component of the proposed model describes the survival distribution of the uncured and is modeled using a Cox proportional hazards structure to preserve the easy interpretation of covariate effects. To the best of our knowledge, this is the first work that employs a machine learning algorithm to analyze MCIC data in the presence of a cured subgroup. To estimate the model parameters, we develop an expectation maximization algorithm. A detailed simulation study demonstrates the superiority of the proposed SVM-based model. Finally, we analyze NASA's Hypobaric Decompression Sickness Data using the proposed approach.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10796-3.

我们提出了一种半参数双分量模型,用于分析具有固定子群的混合案例间隔截尾(MCIC)数据。当我们只知道感兴趣的事件的时间属于一个区间时,例如,k个随机检查时间点,其中k代表一个整数。此外,有一定比例的受试者永远不会对该事件敏感。该模型的第一部分描述了治愈的概率,它用一种更灵活的基于支持向量机(SVM)的方法取代了传统的广义线性模型,该方法能够捕获复杂的协变量效应。该模型的第二个组成部分描述了未治愈患者的生存分布,并使用Cox比例风险结构建模,以保持协变量效应的易于解释。据我们所知,这是第一个使用机器学习算法在治愈子组存在的情况下分析MCIC数据的工作。为了估计模型参数,我们开发了一种期望最大化算法。详细的仿真研究证明了基于支持向量机的模型的优越性。最后,我们使用提出的方法分析了NASA的低压减压病数据。补充资料:在线版本提供补充资料,网址:10.1007/s11222-025-10796-3。
{"title":"A Support vector machine-based mixture cure model for mixed case interval censored data.","authors":"Suvra Pal, Wisdom Aselisewine","doi":"10.1007/s11222-025-10796-3","DOIUrl":"10.1007/s11222-025-10796-3","url":null,"abstract":"<p><p>We propose a semi-parametric two-component model for the analysis of mixed case interval censored (MCIC) data with a cured subgroup. Such data occurs when the time to an event of interest is only known to belong to an interval obtained from a sequence of, say, <i>k</i> random examination time points with <i>k</i> representing an integer. Furthermore, there is a proportion of subjects who would never be susceptible to the event. The first component of the proposed model describes the probability of cure, and it replaces the traditional generalized linear model with a more flexible support vector machine (SVM)-based approach capable of capturing complex covariate effects. The second component of the proposed model describes the survival distribution of the uncured and is modeled using a Cox proportional hazards structure to preserve the easy interpretation of covariate effects. To the best of our knowledge, this is the first work that employs a machine learning algorithm to analyze MCIC data in the presence of a cured subgroup. To estimate the model parameters, we develop an expectation maximization algorithm. A detailed simulation study demonstrates the superiority of the proposed SVM-based model. Finally, we analyze NASA's Hypobaric Decompression Sickness Data using the proposed approach.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10796-3.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"36 2","pages":"63"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12811344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145998515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model-based clustering of time-dependent observations with common structural changes. 具有共同结构变化的时变观测的基于模型的聚类。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-01 Epub Date: 2025-10-28 DOI: 10.1007/s11222-025-10756-x
Riccardo Corradin, Luca Danese, Wasiur R KhudaBukhsh, Andrea Ongaro

We propose a novel model-based clustering approach for samples of time series. We assume as a unique commonality that two observations belong to the same group if structural changes in their behaviors happen at the same time. We resort to a latent representation of structural changes in each time series, based on random orders, to induce ties among different observations. Such an approach results in a general modeling strategy and can be combined with many time-dependent models already known in the literature. Our studies have been motivated by an epidemiological problem. Specifically, we want to provide clusters of different countries of the European Union where two countries belong to the same cluster if the spreading processes of the COVID-19 virus show structural changes at the same time.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10756-x.

我们提出了一种新的基于模型的时间序列样本聚类方法。我们假设,如果两个观察对象的行为同时发生结构性变化,那么它们属于同一组,这是一种独特的共性。我们采用基于随机顺序的每个时间序列结构变化的潜在表示来诱导不同观测值之间的联系。这种方法产生了一种通用的建模策略,并且可以与文献中已知的许多时间相关模型相结合。我们的研究是由一个流行病学问题推动的。具体来说,如果新冠病毒的传播过程同时出现结构性变化,我们希望提供欧盟不同国家的集群,其中两个国家属于同一集群。补充信息:在线版本包含补充资料,提供地址:10.1007/s11222-025-10756-x。
{"title":"Model-based clustering of time-dependent observations with common structural changes.","authors":"Riccardo Corradin, Luca Danese, Wasiur R KhudaBukhsh, Andrea Ongaro","doi":"10.1007/s11222-025-10756-x","DOIUrl":"10.1007/s11222-025-10756-x","url":null,"abstract":"<p><p>We propose a novel model-based clustering approach for samples of time series. We assume as a unique commonality that two observations belong to the same group if structural changes in their behaviors happen at the same time. We resort to a latent representation of structural changes in each time series, based on random orders, to induce ties among different observations. Such an approach results in a general modeling strategy and can be combined with many time-dependent models already known in the literature. Our studies have been motivated by an epidemiological problem. Specifically, we want to provide clusters of different countries of the European Union where two countries belong to the same cluster if the spreading processes of the COVID-19 virus show structural changes at the same time.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10756-x.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"36 1","pages":"7"},"PeriodicalIF":1.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Neural Network Integrated Accelerated Failure Time-Based Mixture Cure Model. 基于加速失效时间的神经网络混合固化模型。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-01 Epub Date: 2025-06-22 DOI: 10.1007/s11222-025-10674-y
Wisdom Aselisewine, Suvra Pal

The mixture cure rate model (MCM) is commonly used for analyzing survival data with a cured subgroup. While the prevailing approach to modeling the probability of cure involves a generalized linear model using a known parametric link function, such as the logit link function, it has limitations in capturing the complex effects of covariates on cure probability. This paper introduces a novel MCM employing a neural network-based classifier for cure probability and an accelerated failure time structure for the survival distribution of uncured patients. An expectation maximization algorithm is developed for parameter estimation. Simulation results demonstrate the superior performance of the proposed model in capturing non-linear classification boundaries compared to logit-based and spline-based MCMs, as well as other machine learning algorithms. This enhances the accuracy and precision of cured probability estimates, improving predictive accuracy. The proposed model and estimation method are applied to survival data on leukemia cancer patients, showcasing their effectiveness.

混合治愈率模型(MCM)通常用于分析治愈亚组的生存数据。虽然对治愈概率建模的主流方法涉及使用已知参数链接函数(如logit链接函数)的广义线性模型,但它在捕获协变量对治愈概率的复杂影响方面存在局限性。本文介绍了一种新的MCM算法,该算法采用基于神经网络的治愈概率分类器和加速失效时间结构来计算未治愈患者的生存分布。提出了一种参数估计的期望最大化算法。仿真结果表明,与基于逻辑和样条的mcm以及其他机器学习算法相比,所提出的模型在捕获非线性分类边界方面具有优越的性能。这提高了固化概率估计的准确性和精密度,提高了预测的准确性。将该模型和估计方法应用于白血病患者的生存数据,验证了其有效性。
{"title":"A Neural Network Integrated Accelerated Failure Time-Based Mixture Cure Model.","authors":"Wisdom Aselisewine, Suvra Pal","doi":"10.1007/s11222-025-10674-y","DOIUrl":"10.1007/s11222-025-10674-y","url":null,"abstract":"<p><p>The mixture cure rate model (MCM) is commonly used for analyzing survival data with a cured subgroup. While the prevailing approach to modeling the probability of cure involves a generalized linear model using a known parametric link function, such as the logit link function, it has limitations in capturing the complex effects of covariates on cure probability. This paper introduces a novel MCM employing a neural network-based classifier for cure probability and an accelerated failure time structure for the survival distribution of uncured patients. An expectation maximization algorithm is developed for parameter estimation. Simulation results demonstrate the superior performance of the proposed model in capturing non-linear classification boundaries compared to logit-based and spline-based MCMs, as well as other machine learning algorithms. This enhances the accuracy and precision of cured probability estimates, improving predictive accuracy. The proposed model and estimation method are applied to survival data on leukemia cancer patients, showcasing their effectiveness.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 5","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12369597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144969648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bootstrap estimation of the proportion of outliers in robust regression. 稳健回归中异常值比例的自举估计。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-02-01 Epub Date: 2024-11-16 DOI: 10.1007/s11222-024-10526-1
Qiang Heng, Kenneth Lange

This paper presents a nonparametric bootstrap method for estimating the proportions of inliers and outliers in robust regression models. Our approach is based on the concept of stability, providing robustness against distributional assumptions and eliminating the need for pre-specified confidence levels. Through numerical experiments, we demonstrate that this method yields more accurate and stable estimates than existing alternatives. Additionally, the generated instability paths offer a valuable graphical tool for understanding the inlier and outlier distributions within the data. The method naturally extends to generalized linear models, where we find that variance-stabilizing transformations produce residuals that are well-suited for outlier detection. Applications to two real-world datasets further illustrate the practical utility of our approach in identifying outliers.

本文提出了一种估计鲁棒回归模型中离群值和内群值比例的非参数自举方法。我们的方法基于稳定性的概念,提供了对分布假设的鲁棒性,并且消除了预先指定置信水平的需要。通过数值实验,我们证明了该方法比现有的替代方法产生更准确和稳定的估计。此外,生成的不稳定性路径提供了一个有价值的图形工具,用于理解数据中的内线和离群分布。该方法自然地扩展到广义线性模型,我们发现方差稳定变换产生的残差非常适合于离群值检测。对两个真实世界数据集的应用进一步说明了我们的方法在识别异常值方面的实际效用。
{"title":"Bootstrap estimation of the proportion of outliers in robust regression.","authors":"Qiang Heng, Kenneth Lange","doi":"10.1007/s11222-024-10526-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10526-1","url":null,"abstract":"<p><p>This paper presents a nonparametric bootstrap method for estimating the proportions of inliers and outliers in robust regression models. Our approach is based on the concept of stability, providing robustness against distributional assumptions and eliminating the need for pre-specified confidence levels. Through numerical experiments, we demonstrate that this method yields more accurate and stable estimates than existing alternatives. Additionally, the generated instability paths offer a valuable graphical tool for understanding the inlier and outlier distributions within the data. The method naturally extends to generalized linear models, where we find that variance-stabilizing transformations produce residuals that are well-suited for outlier detection. Applications to two real-world datasets further illustrate the practical utility of our approach in identifying outliers.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077844/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simulation based composite likelihood. 基于模拟的复合可能性。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-02-25 DOI: 10.1007/s11222-025-10584-z
Lorenzo Rimella, Chris Jewell, Paul Fearnhead

Inference for high-dimensional hidden Markov models is challenging due to the exponential-in-dimension computational cost of calculating the likelihood. To address this issue, we introduce an innovative composite likelihood approach called "Simulation Based Composite Likelihood" (SimBa-CL). With SimBa-CL, we approximate the likelihood by the product of its marginals, which we estimate using Monte Carlo sampling. In a similar vein to approximate Bayesian computation (ABC), SimBa-CL requires multiple simulations from the model, but, in contrast to ABC, it provides a likelihood approximation that guides the optimization of the parameters. Leveraging automatic differentiation libraries, it is simple to calculate gradients and Hessians to not only speed up optimization but also to build approximate confidence sets. We present extensive empirical results which validate our theory and demonstrate its advantage over SMC, and apply SimBa-CL to real-world Aphtovirus data.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10584-z.

高维隐马尔可夫模型的推理是具有挑战性的,因为计算可能性的计算成本是指数维的。为了解决这个问题,我们引入了一种创新的复合似然方法,称为“基于模拟的复合似然”(SimBa-CL)。使用SimBa-CL,我们通过其边际的乘积来近似似然,我们使用蒙特卡罗采样来估计。与近似贝叶斯计算(ABC)类似,SimBa-CL需要从模型中进行多次模拟,但是,与ABC相反,它提供了指导参数优化的似然近似。利用自动微分库,可以简单地计算梯度和Hessians,不仅可以加快优化速度,还可以构建近似置信集。我们提出了广泛的实证结果,验证了我们的理论,并证明了其优于SMC的优势,并将SimBa-CL应用于现实世界的阿夫托病毒数据。补充信息:在线版本包含补充资料,提供地址为10.1007/s11222-025-10584-z。
{"title":"Simulation based composite likelihood.","authors":"Lorenzo Rimella, Chris Jewell, Paul Fearnhead","doi":"10.1007/s11222-025-10584-z","DOIUrl":"10.1007/s11222-025-10584-z","url":null,"abstract":"<p><p>Inference for high-dimensional hidden Markov models is challenging due to the exponential-in-dimension computational cost of calculating the likelihood. To address this issue, we introduce an innovative composite likelihood approach called \"Simulation Based Composite Likelihood\" (SimBa-CL). With SimBa-CL, we approximate the likelihood by the product of its marginals, which we estimate using Monte Carlo sampling. In a similar vein to approximate Bayesian computation (ABC), SimBa-CL requires multiple simulations from the model, but, in contrast to ABC, it provides a likelihood approximation that guides the optimization of the parameters. Leveraging automatic differentiation libraries, it is simple to calculate gradients and Hessians to not only speed up optimization but also to build approximate confidence sets. We present extensive empirical results which validate our theory and demonstrate its advantage over SMC, and apply SimBa-CL to real-world Aphtovirus data.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10584-z.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"58"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11861035/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143524490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential Bayesian Registration for Functional Data. 功能数据的顺序贝叶斯配准。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-05-27 DOI: 10.1007/s11222-025-10640-8
Yoonji Kim, Oksana A Chkrebtii, Sebastian A Kurtek

In many modern applications, discretely-observed data may be naturally understood as a set of functions. Functional data often exhibit two confounded sources of variability: amplitude (y-axis) and phase (x-axis). The extraction of amplitude and phase, a process known as registration, is essential in exploring the underlying structure of functional data in a variety of areas, from environmental monitoring to medical imaging. Critically, such data are often gathered sequentially with new functional observations arriving over time. Despite this, existing registration procedures do not sequentially update inference based on the new data, requiring model refitting. To address these challenges, we introduce a Bayesian framework for sequential registration of functional data, which updates statistical inference as new sets of functions are assimilated. This Bayesian model-based sequential learning approach utilizes sequential Monte Carlo sampling to recursively update the alignment of observed functions while accounting for associated uncertainty. Distributed computing significantly reduces computational cost relative to refitting the model using an iterative method such as Markov chain Monte Carlo on the full data. Simulation studies and comparisons reveal that the proposed approach performs well even when the target posterior distribution has a challenging structure. We apply the proposed method to three real datasets: (1) functions of annual drought intensity near Kaweah River in California, (2) annual sea surface salinity functions near Null Island, and (3) a sequence of repeated patterns in electrocardiogram signals.

在许多现代应用中,离散观测数据可以很自然地理解为一组函数。功能数据通常表现出两个混杂的变异性来源:振幅(y轴)和相位(x轴)。振幅和相位的提取,一个被称为配准的过程,对于探索从环境监测到医学成像等各种领域的功能数据的潜在结构至关重要。关键的是,这些数据通常是随着时间的推移,随着新的功能观察的到来而顺序收集的。尽管如此,现有的配准程序不能根据新数据顺序更新推理,需要对模型进行改装。为了解决这些挑战,我们引入了一个贝叶斯框架,用于功能数据的顺序注册,该框架在吸收新函数集时更新统计推断。这种基于贝叶斯模型的顺序学习方法利用顺序蒙特卡罗采样递归地更新观察到的函数的对齐,同时考虑到相关的不确定性。分布式计算相对于在全数据上使用马尔可夫链蒙特卡罗等迭代方法重新调整模型,显著降低了计算成本。仿真研究和比较表明,即使在目标后验分布具有挑战性的情况下,该方法也具有良好的性能。我们将该方法应用于三个实际数据集:(1)加利福尼亚州Kaweah河附近的年干旱强度函数,(2)Null岛附近的年海面盐度函数,以及(3)心电图信号的重复模式序列。
{"title":"Sequential Bayesian Registration for Functional Data.","authors":"Yoonji Kim, Oksana A Chkrebtii, Sebastian A Kurtek","doi":"10.1007/s11222-025-10640-8","DOIUrl":"10.1007/s11222-025-10640-8","url":null,"abstract":"<p><p>In many modern applications, discretely-observed data may be naturally understood as a set of functions. Functional data often exhibit two confounded sources of variability: amplitude (<i>y</i>-axis) and phase (<i>x</i>-axis). The extraction of amplitude and phase, a process known as registration, is essential in exploring the underlying structure of functional data in a variety of areas, from environmental monitoring to medical imaging. Critically, such data are often gathered sequentially with new functional observations arriving over time. Despite this, existing registration procedures do not sequentially update inference based on the new data, requiring model refitting. To address these challenges, we introduce a Bayesian framework for sequential registration of functional data, which updates statistical inference as new sets of functions are assimilated. This Bayesian model-based sequential learning approach utilizes sequential Monte Carlo sampling to recursively update the alignment of observed functions while accounting for associated uncertainty. Distributed computing significantly reduces computational cost relative to refitting the model using an iterative method such as Markov chain Monte Carlo on the full data. Simulation studies and comparisons reveal that the proposed approach performs well even when the target posterior distribution has a challenging structure. We apply the proposed method to three real datasets: (1) functions of annual drought intensity near Kaweah River in California, (2) annual sea surface salinity functions near Null Island, and (3) a sequence of repeated patterns in electrocardiogram signals.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 4","pages":"108"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12116714/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144182656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Outcome-guided spike-and-slab Lasso Biclustering: A Novel Approach for Enhancing Biclustering Techniques for Gene Expression Analysis. 结果导向的穗板Lasso双聚类:一种增强基因表达分析双聚类技术的新方法。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-08-28 DOI: 10.1007/s11222-025-10709-4
Luis A Vargas-Mieles, Paul D W Kirk, Chris Wallace

Biclustering has gained interest in gene expression data analysis due to its ability to identify groups of samples that exhibit similar behaviour in specific subsets of genes (or vice versa), in contrast to traditional clustering methods that classify samples based on all genes. Despite advances, biclustering remains a challenging problem, even with cutting-edge methodologies. This paper introduces an extension of the recently proposed Spike-and-Slab Lasso Biclustering (SSLB) algorithm, termed Outcome-Guided SSLB (OG-SSLB), aimed at enhancing the identification of biclusters in gene expression analysis. Our proposed approach integrates disease outcomes into the biclustering framework through Bayesian profile regression. By leveraging additional clinical information, OG-SSLB improves the interpretability and relevance of the resulting biclusters. Comprehensive simulations and numerical experiments demonstrate that OG-SSLB achieves superior performance, with improved accuracy in estimating the number of clusters and higher consensus scores compared to the original SSLB method. Furthermore, OG-SSLB effectively identifies meaningful patterns and associations between gene expression profiles and disease states. These promising results demonstrate the effectiveness of OG-SSLB in advancing biclustering techniques, providing a powerful tool for uncovering biologically relevant insights. The OGSSLB software can be found as an R/C++ package at https://github.com/luisvargasmieles/OGSSLB.

与传统的基于所有基因对样本进行分类的聚类方法相比,双聚类方法能够识别在特定基因子集中表现出相似行为的样本组(反之亦然),因此对基因表达数据分析产生了兴趣。尽管取得了进步,但即使使用尖端的方法,双聚类仍然是一个具有挑战性的问题。本文介绍了最近提出的spike - slab Lasso双聚类(SSLB)算法的扩展,称为结果导向SSLB (OG-SSLB),旨在增强基因表达分析中双聚类的识别。我们提出的方法通过贝叶斯剖面回归将疾病结果整合到双聚类框架中。通过利用额外的临床信息,OG-SSLB提高了结果双聚类的可解释性和相关性。综合仿真和数值实验表明,OG-SSLB方法具有较好的性能,与原始的SSLB方法相比,OG-SSLB方法在估计聚类数量方面具有更高的准确性和更高的一致性分数。此外,OG-SSLB有效识别基因表达谱和疾病状态之间有意义的模式和关联。这些有希望的结果证明了OG-SSLB在推进双聚类技术方面的有效性,为揭示生物学相关的见解提供了一个强大的工具。可以在https://github.com/luisvargasmieles/OGSSLB上找到OGSSLB软件的R/ c++包。
{"title":"Outcome-guided spike-and-slab Lasso Biclustering: A Novel Approach for Enhancing Biclustering Techniques for Gene Expression Analysis.","authors":"Luis A Vargas-Mieles, Paul D W Kirk, Chris Wallace","doi":"10.1007/s11222-025-10709-4","DOIUrl":"10.1007/s11222-025-10709-4","url":null,"abstract":"<p><p>Biclustering has gained interest in gene expression data analysis due to its ability to identify groups of samples that exhibit similar behaviour in specific subsets of genes (or vice versa), in contrast to traditional clustering methods that classify samples based on all genes. Despite advances, biclustering remains a challenging problem, even with cutting-edge methodologies. This paper introduces an extension of the recently proposed Spike-and-Slab Lasso Biclustering (SSLB) algorithm, termed Outcome-Guided SSLB (OG-SSLB), aimed at enhancing the identification of biclusters in gene expression analysis. Our proposed approach integrates disease outcomes into the biclustering framework through Bayesian profile regression. By leveraging additional clinical information, OG-SSLB improves the interpretability and relevance of the resulting biclusters. Comprehensive simulations and numerical experiments demonstrate that OG-SSLB achieves superior performance, with improved accuracy in estimating the number of clusters and higher consensus scores compared to the original SSLB method. Furthermore, OG-SSLB effectively identifies meaningful patterns and associations between gene expression profiles and disease states. These promising results demonstrate the effectiveness of OG-SSLB in advancing biclustering techniques, providing a powerful tool for uncovering biologically relevant insights. The OGSSLB software can be found as an R/C++ package at https://github.com/luisvargasmieles/OGSSLB.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 6","pages":"179"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12394340/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144969714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extended fiducial inference for individual treatment effects via deep neural networks. 基于深度神经网络的个体治疗效果扩展基准推断。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-05-17 DOI: 10.1007/s11222-025-10624-8
Sehwan Kim, Faming Liang

Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep neural networks are used to model the treatment and control effect functions, while an additional neural network is employed to estimate their parameters. The universal approximation capability of deep neural networks ensures the broad applicability of this method. Numerical results highlight the superior performance of the proposed Double-NN method compared to the conformal quantile regression (CQR) method in individual treatment effect estimation. From the perspective of statistical inference, this work advances the theory and methodology for statistical inference of large models. Specifically, it is theoretically proven that the proposed method permits the model size to increase with the sample size n at a rate of O ( n ζ ) for some 0 ζ < 1 , while still maintaining proper quantification of uncertainty in the model parameters. This result marks a significant improvement compared to the range 0 ζ < 1 2 required by the classical central limit theorem. Furthermore, this work provides a rigorous framework for quantifying the uncertainty of deep neural networks under the neural scaling law, representing a substantial contribution to the statistical understanding of large-scale neural network models.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10624-8.

在最近的数据科学文献中,个体治疗效果估计得到了极大的关注。本文介绍了双神经网络(Double- nn)方法在扩展基准推理(EFI)框架内解决这一问题。在该方法中,采用深度神经网络对处理和控制效果函数进行建模,并采用附加神经网络对其参数进行估计。深度神经网络的通用逼近能力保证了该方法的广泛适用性。数值结果表明,该方法在个体治疗效果估计方面优于保形分位数回归(CQR)方法。从统计推断的角度出发,提出了大模型统计推断的理论和方法。具体来说,理论上证明了所提出的方法允许模型尺寸随样本量n以0 (n ζ)的速率增加,对于某些0≤ζ 1,同时仍然保持模型参数中不确定性的适当量化。与经典中心极限定理要求的0≤ζ 12的范围相比,这个结果标志着一个显著的改进。此外,这项工作为在神经标度律下量化深度神经网络的不确定性提供了一个严格的框架,对大规模神经网络模型的统计理解做出了重大贡献。补充资料:在线版本包含补充资料,下载地址:10.1007/s11222-025-10624-8。
{"title":"Extended fiducial inference for individual treatment effects via deep neural networks.","authors":"Sehwan Kim, Faming Liang","doi":"10.1007/s11222-025-10624-8","DOIUrl":"10.1007/s11222-025-10624-8","url":null,"abstract":"<p><p>Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep neural networks are used to model the treatment and control effect functions, while an additional neural network is employed to estimate their parameters. The universal approximation capability of deep neural networks ensures the broad applicability of this method. Numerical results highlight the superior performance of the proposed Double-NN method compared to the conformal quantile regression (CQR) method in individual treatment effect estimation. From the perspective of statistical inference, this work advances the theory and methodology for statistical inference of large models. Specifically, it is theoretically proven that the proposed method permits the model size to increase with the sample size <i>n</i> at a rate of <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>n</mi> <mi>ζ</mi></msup> <mo>)</mo></mrow> </math> for some <math><mrow><mn>0</mn> <mo>≤</mo> <mi>ζ</mi> <mo><</mo> <mn>1</mn></mrow> </math> , while still maintaining proper quantification of uncertainty in the model parameters. This result marks a significant improvement compared to the range <math><mrow><mn>0</mn> <mo>≤</mo> <mi>ζ</mi> <mo><</mo> <mfrac><mn>1</mn> <mn>2</mn></mfrac> </mrow> </math> required by the classical central limit theorem. Furthermore, this work provides a rigorous framework for quantifying the uncertainty of deep neural networks under the neural scaling law, representing a substantial contribution to the statistical understanding of large-scale neural network models.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10624-8.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 4","pages":"97"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12085359/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144102739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian shared parameter joint models for heterogeneous populations. 异质种群的贝叶斯共享参数联合模型。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-06-12 DOI: 10.1007/s11222-025-10647-1
Sida Chen, Danilo Alvares, Marco Palma, Jessica K Barrett

Joint models (JMs) for longitudinal and time-to-event data are an important class of biostatistical models in health and medical research. When the study population consists of heterogeneous subgroups, standard JMs may be inadequate, leading to misleading results or loss of information. Joint latent class models (JLCMs) and their variants have been proposed to incorporate latent class structures into JMs. JLCMs are useful for identifying latent subgroups, uncovering deeper insights into relationships between the outcomes, and improving prediction performance. We consider the problem of Bayesian inference for the generic form of JLCMs, which poses significant computational challenges due to the complex nature of the posterior distribution. We propose a new Bayesian inference framework to tackle these challenges. Our approach leverages state-of-the-art Markov chain Monte Carlo techniques and parallel computing for parameter estimation and model selection regarding the number of latent classes. Through a simulation study, we demonstrate the feasibility and superiority of our proposed method over the existing approach. Additionally, we provide practical guidance on model and prior specification, which has received little attention, to facilitate the implementation of such complex models. We illustrate our method using data from the PAQUID prospective cohort study, where the outcomes of interest include a longitudinal measurement of cognitive performance and time to dementia diagnosis. Our analysis provides deeper insights into the latent class characteristics underlying the study population.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10647-1.

纵向和事件时间数据联合模型(JMs)是卫生和医学研究中一类重要的生物统计模型。当研究人群由异质亚组组成时,标准JMs可能不充分,导致误导性结果或信息丢失。联合潜在类模型(jlcm)及其变体被提出将潜在类结构纳入JMs。jlcm对于识别潜在的子组、揭示对结果之间关系的更深入的了解以及提高预测性能非常有用。我们考虑了jlcm一般形式的贝叶斯推理问题,由于后验分布的复杂性,该问题带来了重大的计算挑战。我们提出了一个新的贝叶斯推理框架来解决这些挑战。我们的方法利用最先进的马尔可夫链蒙特卡罗技术和并行计算进行参数估计和关于潜在类别数量的模型选择。通过仿真研究,我们证明了该方法的可行性和优越性。此外,我们还提供了关于模型和先验规范的实用指导,这一点很少受到关注,以促进此类复杂模型的实现。我们使用来自PAQUID前瞻性队列研究的数据来说明我们的方法,其中感兴趣的结果包括认知表现和痴呆诊断时间的纵向测量。我们的分析为研究人群潜在的阶级特征提供了更深入的见解。补充资料:在线版本包含补充资料,下载地址:10.1007/s11222-025-10647-1。
{"title":"Bayesian shared parameter joint models for heterogeneous populations.","authors":"Sida Chen, Danilo Alvares, Marco Palma, Jessica K Barrett","doi":"10.1007/s11222-025-10647-1","DOIUrl":"10.1007/s11222-025-10647-1","url":null,"abstract":"<p><p>Joint models (JMs) for longitudinal and time-to-event data are an important class of biostatistical models in health and medical research. When the study population consists of heterogeneous subgroups, standard JMs may be inadequate, leading to misleading results or loss of information. Joint latent class models (JLCMs) and their variants have been proposed to incorporate latent class structures into JMs. JLCMs are useful for identifying latent subgroups, uncovering deeper insights into relationships between the outcomes, and improving prediction performance. We consider the problem of Bayesian inference for the generic form of JLCMs, which poses significant computational challenges due to the complex nature of the posterior distribution. We propose a new Bayesian inference framework to tackle these challenges. Our approach leverages state-of-the-art Markov chain Monte Carlo techniques and parallel computing for parameter estimation and model selection regarding the number of latent classes. Through a simulation study, we demonstrate the feasibility and superiority of our proposed method over the existing approach. Additionally, we provide practical guidance on model and prior specification, which has received little attention, to facilitate the implementation of such complex models. We illustrate our method using data from the PAQUID prospective cohort study, where the outcomes of interest include a longitudinal measurement of cognitive performance and time to dementia diagnosis. Our analysis provides deeper insights into the latent class characteristics underlying the study population.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10647-1.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 5","pages":"125"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12162714/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144302837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistics and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1