首页 > 最新文献

Statistics and Computing最新文献

英文 中文
Efficient Likelihood-Based Temporal Changepoint Detection in Spatio-Temporal Processes. 基于似然的时空变化点检测方法。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-10-17 DOI: 10.1007/s11222-025-10745-0
Gaurav Agarwal, Idris A Eckley, Paul Fearnhead

The rapid advancements of scalable methodologies have opened new avenues for analyzing complex spatio-temporal data, which is crucial in understanding dynamic environmental phenomena. This paper introduces a likelihood-based methodology for detecting abrupt changes in time in spatio-temporal processes, a field where traditional time series methods fall short. Unlike recent approaches, we do not make the unrealistic assumption that data is independent across changepoints. Instead, we use a recently proposed family of covariance models that allows nonstationarity in time, and we propose a Markov approximation to reduce the computational burden of calculating likelihoods under this model. We apply our method to two years of daily wind speed data from various synoptic weather stations in Ireland, identifying a significant changepoint on July 24, 2021, which aligns with a major shift in weather patterns. This application not only demonstrates the method's utility in handling spatio-temporal datasets but also showcases its potential in broader environmental and climatic studies, offering a scalable solution for analyzing changing patterns in spatial data over time.

可扩展方法的快速发展为分析复杂的时空数据开辟了新的途径,这对于理解动态环境现象至关重要。本文介绍了一种基于似然的方法来检测时空过程中的时间突变,这是传统时间序列方法所欠缺的领域。与最近的方法不同,我们没有做出不切实际的假设,即数据在各个更改点之间是独立的。相反,我们使用了最近提出的一系列协方差模型,这些模型允许时间上的非平稳性,并且我们提出了一个马尔可夫近似来减少在该模型下计算可能性的计算负担。我们将我们的方法应用于爱尔兰各天气气象站两年的每日风速数据,确定了2021年7月24日的一个重要变化点,这与天气模式的重大转变相一致。该应用程序不仅展示了该方法在处理时空数据集方面的实用性,而且还展示了其在更广泛的环境和气候研究中的潜力,为分析空间数据随时间变化的模式提供了可扩展的解决方案。
{"title":"Efficient Likelihood-Based Temporal Changepoint Detection in Spatio-Temporal Processes.","authors":"Gaurav Agarwal, Idris A Eckley, Paul Fearnhead","doi":"10.1007/s11222-025-10745-0","DOIUrl":"10.1007/s11222-025-10745-0","url":null,"abstract":"<p><p>The rapid advancements of scalable methodologies have opened new avenues for analyzing complex spatio-temporal data, which is crucial in understanding dynamic environmental phenomena. This paper introduces a likelihood-based methodology for detecting abrupt changes in time in spatio-temporal processes, a field where traditional time series methods fall short. Unlike recent approaches, we do not make the unrealistic assumption that data is independent across changepoints. Instead, we use a recently proposed family of covariance models that allows nonstationarity in time, and we propose a Markov approximation to reduce the computational burden of calculating likelihoods under this model. We apply our method to two years of daily wind speed data from various synoptic weather stations in Ireland, identifying a significant changepoint on July 24, 2021, which aligns with a major shift in weather patterns. This application not only demonstrates the method's utility in handling spatio-temporal datasets but also showcases its potential in broader environmental and climatic studies, offering a scalable solution for analyzing changing patterns in spatial data over time.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 6","pages":"213"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12534301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using prior-data conflict to tune Bayesian regularized regression models. 利用先验数据冲突优化贝叶斯正则化回归模型。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-02-20 DOI: 10.1007/s11222-025-10582-1
Timofei Biziaev, Karen Kopciuk, Thierry Chekouo

In high-dimensional regression models, variable selection becomes challenging from a computational and theoretical perspective. Bayesian regularized regression via shrinkage priors like the Laplace or spike-and-slab prior are effective methods for variable selection in p > n scenarios provided the shrinkage priors are configured adequately. We propose an empirical Bayes configuration using checks for prior-data conflict: tests that assess whether there is disagreement in parameter information provided by the prior and data. We apply our proposed method to the Bayesian LASSO and spike-and-slab shrinkage priors in the linear regression model and assess the variable selection performance of our prior configurations through a high-dimensional simulation study. Additionally, we apply our method to proteomic data collected from patients admitted to the Albany Medical Center in Albany NY in April of 2020 with COVID-like respiratory issues. Simulation results suggest our proposed configurations may outperform competing models when the true regression effects are small.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10582-1.

在高维回归模型中,从计算和理论的角度来看,变量选择变得具有挑战性。贝叶斯正则化回归通过收缩先验,如拉普拉斯或尖钉-板先验是有效的方法,为变量选择在bbbbn的情况下,只要收缩先验配置充分。我们提出了一个使用先验数据冲突检查的经验贝叶斯配置:评估先验和数据提供的参数信息是否存在分歧的测试。我们将提出的方法应用于线性回归模型中的贝叶斯拉索和尖钉-板收缩先验,并通过高维模拟研究评估我们的先验配置的变量选择性能。此外,我们将我们的方法应用于从2020年4月入住纽约州奥尔巴尼奥尔巴尼医疗中心的患者中收集的蛋白质组学数据,这些患者患有类似covid - 19的呼吸问题。仿真结果表明,当真正的回归效应很小时,我们提出的配置可能优于竞争模型。补充资料:在线版本提供补充资料,网址为10.1007/s11222-025-10582-1。
{"title":"Using prior-data conflict to tune Bayesian regularized regression models.","authors":"Timofei Biziaev, Karen Kopciuk, Thierry Chekouo","doi":"10.1007/s11222-025-10582-1","DOIUrl":"10.1007/s11222-025-10582-1","url":null,"abstract":"<p><p>In high-dimensional regression models, variable selection becomes challenging from a computational and theoretical perspective. Bayesian regularized regression via shrinkage priors like the Laplace or spike-and-slab prior are effective methods for variable selection in <math><mrow><mi>p</mi> <mo>></mo> <mi>n</mi></mrow> </math> scenarios provided the shrinkage priors are configured adequately. We propose an empirical Bayes configuration using checks for prior-data conflict: tests that assess whether there is disagreement in parameter information provided by the prior and data. We apply our proposed method to the Bayesian LASSO and spike-and-slab shrinkage priors in the linear regression model and assess the variable selection performance of our prior configurations through a high-dimensional simulation study. Additionally, we apply our method to proteomic data collected from patients admitted to the Albany Medical Center in Albany NY in April of 2020 with COVID-like respiratory issues. Simulation results suggest our proposed configurations may outperform competing models when the true regression effects are small.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10582-1.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 2","pages":"53"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new p-value based multiple testing procedure for generalized linear models. 一种新的基于p值的广义线性模型多重检验方法。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-03-16 DOI: 10.1007/s11222-025-10600-2
Joseph Rilling, Cheng Yong Tang

This study introduces a novel p-value-based multiple testing approach tailored for generalized linear models. Despite the crucial role of generalized linear models in statistics, existing methodologies face obstacles arising from the heterogeneous variance of response variables and complex dependencies among estimated parameters. Our aim is to address the challenge of controlling the false discovery rate (FDR) amidst arbitrarily dependent test statistics. Through the development of efficient computational algorithms, we present a versatile statistical framework for multiple testing. The proposed framework accommodates a range of tools developed for constructing a new model matrix in regression-type analysis, including random row permutations and Model-X knockoffs. We devise efficient computing techniques to solve the encountered non-trivial quadratic matrix equations, enabling the construction of paired p-values suitable for the two-step multiple testing procedure proposed by Sarkar and Tang (Biometrika 109(4): 1149-1155, 2022). Theoretical analysis affirms the properties of our approach, demonstrating its capability to control the FDR at a given level. Empirical evaluations further substantiate its promising performance across diverse simulation settings.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-025-10600-2.

本文介绍了一种针对广义线性模型的基于p值的多重检验方法。尽管广义线性模型在统计中起着至关重要的作用,但由于响应变量的异质性和估计参数之间的复杂依赖关系,现有的方法面临着障碍。我们的目标是解决在任意依赖的测试统计中控制错误发现率(FDR)的挑战。通过开发高效的计算算法,我们提出了一个通用的多重测试统计框架。所提议的框架包含了一系列用于在回归型分析中构建新模型矩阵的工具,包括随机行排列和model - x仿制品。我们设计了高效的计算技术来求解遇到的非平凡二次矩阵方程,从而能够构建适合Sarkar和Tang (Biometrika 109(4): 1149-1155, 2022)提出的两步多重检验程序的成对p值。理论分析证实了我们的方法的特性,证明了它在给定水平上控制FDR的能力。经验评估进一步证实了其在不同模拟设置中的良好性能。补充信息:在线版本包含补充信息,提供地址为10.1007/s11222-025-10600-2。
{"title":"A new <i>p</i>-value based multiple testing procedure for generalized linear models.","authors":"Joseph Rilling, Cheng Yong Tang","doi":"10.1007/s11222-025-10600-2","DOIUrl":"10.1007/s11222-025-10600-2","url":null,"abstract":"<p><p>This study introduces a novel <i>p</i>-value-based multiple testing approach tailored for generalized linear models. Despite the crucial role of generalized linear models in statistics, existing methodologies face obstacles arising from the heterogeneous variance of response variables and complex dependencies among estimated parameters. Our aim is to address the challenge of controlling the false discovery rate (FDR) amidst arbitrarily dependent test statistics. Through the development of efficient computational algorithms, we present a versatile statistical framework for multiple testing. The proposed framework accommodates a range of tools developed for constructing a new model matrix in regression-type analysis, including random row permutations and Model-X knockoffs. We devise efficient computing techniques to solve the encountered non-trivial quadratic matrix equations, enabling the construction of paired <i>p</i>-values suitable for the two-step multiple testing procedure proposed by Sarkar and Tang (Biometrika 109(4): 1149-1155, 2022). Theoretical analysis affirms the properties of our approach, demonstrating its capability to control the FDR at a given level. Empirical evaluations further substantiate its promising performance across diverse simulation settings.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10600-2.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"69"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11911269/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143658683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation and model selection for finite mixtures of Tukey's g- &-h distributions. Tukey的g- &-h分布的有限混合估计和模型选择。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-03-15 DOI: 10.1007/s11222-025-10596-9
Tingting Zhan, Misung Yi, Amy R Peck, Hallgeir Rui, Inna Chervoneva

A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey's g- &-h distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey's g- &-h distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey's g- &-h mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey's g- &-h mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey's g- &-h mixtures and compare to performance of mixtures of skew-normal or skew-t distributions. The Tukey's g- &-h mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.

有限混合分布是一种流行的统计模型,当感兴趣的总体可能包括不同的子总体时,这种模型特别有意义。这项工作的动机是利用免疫荧光免疫组织化学方法定量分析人体组织的蛋白质表达水平。组织中细胞蛋白表达水平的分布通常表现为多模态、偏态和重尾,但不同对象的不同组织中的分布之间存在很大的变异性,而其中一些混合分布包括符合正态分布假设的成分。为了适应这种多样性,我们提出了一个4参数Tukey的g- &-h分布的混合物,用于拟合具有高斯和非高斯分量的有限混合物。Tukey的g- &-h分布是一种灵活的模型,允许混合成分的偏度和峰度变化,包括正态分布作为一种特殊情况。由于Tukey的g- &-h混合物的似然不具有封闭的解析形式,我们提出了这种混合物参数的分位数最小马氏距离(QLMD)估计量。qmd是一种间接估计量,它最小化了样本和基于模型的分位数之间的马氏距离,其渐近性质遵循间接估计的一般理论。我们开发了一种逐步选择简洁的Tukey的g- &-h混合模型的算法,并在CRAN上可用的R包QuantileGH中实现了所有提出的方法。进行了模拟研究,以评估Tukey的g- &-h混合物的性能,并将其与斜正态分布或斜t分布的混合物的性能进行比较。Tukey的g- &-h混合物用于模拟乳腺癌组织中Cyclin D1蛋白的细胞表达,并将结果参数估计作为无进展生存期的预测指标进行评估。
{"title":"Estimation and model selection for finite mixtures of Tukey's <i>g</i>- &-<i>h</i> distributions.","authors":"Tingting Zhan, Misung Yi, Amy R Peck, Hallgeir Rui, Inna Chervoneva","doi":"10.1007/s11222-025-10596-9","DOIUrl":"10.1007/s11222-025-10596-9","url":null,"abstract":"<p><p>A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey's <i>g</i>- &-<i>h</i> distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey's <i>g</i>- &-<i>h</i> distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey's <i>g</i>- &-<i>h</i> mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey's <i>g</i>- &-<i>h</i> mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey's <i>g</i>- &-<i>h</i> mixtures and compare to performance of mixtures of skew-normal or skew-<i>t</i> distributions. The Tukey's <i>g</i>- &-<i>h</i> mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"67"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11910465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143650810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian additive tree ensembles for composite quantile regressions. 复合分位数回归的贝叶斯加性树集成。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-08-26 DOI: 10.1007/s11222-025-10711-w
Yaeji Lim, Ruijin Lu, Madeleine St Ville, Zhen Chen

In this paper, we introduce a novel approach that integrates Bayesian additive regression trees (BART) with the composite quantile regression (CQR) framework, creating a robust method for modeling complex relationships between predictors and outcomes under various error distributions. Unlike traditional quantile regression, which focuses on specific quantile levels, our proposed method, composite quantile BART, offers greater flexibility in capturing the entire conditional distribution of the response variable. By leveraging the strengths of BART and CQR, the proposed method provides enhanced predictive performance, especially in the presence of heavy-tailed errors and non-linear covariate effects. Numerical studies confirm that the proposed composite quantile BART method generally outperforms classical BART, quantile BART, and composite quantile linear regression models in terms of RMSE, especially under heavy-tailed or contaminated error distributions. Notably, under contaminated normal errors, it reduces RMSE by approximately 17% compared to composite quantile regression, and by 27% compared to classical BART.

在本文中,我们引入了一种将贝叶斯加性回归树(BART)与复合分位数回归(CQR)框架相结合的新方法,创建了一种鲁棒的方法来建模各种误差分布下预测因子和结果之间的复杂关系。与传统的分位数回归(专注于特定的分位数水平)不同,我们提出的复合分位数BART方法在捕获响应变量的整个条件分布方面提供了更大的灵活性。通过利用BART和CQR的优势,该方法提供了增强的预测性能,特别是在存在重尾误差和非线性协变量效应的情况下。数值研究证实,在RMSE方面,本文提出的复合分位数BART方法总体上优于经典BART、分位数BART和复合分位数线性回归模型,特别是在重尾或污染误差分布下。值得注意的是,在受污染的正态误差下,与复合分位数回归相比,它将RMSE降低了约17%,与经典BART相比降低了27%。
{"title":"Bayesian additive tree ensembles for composite quantile regressions.","authors":"Yaeji Lim, Ruijin Lu, Madeleine St Ville, Zhen Chen","doi":"10.1007/s11222-025-10711-w","DOIUrl":"10.1007/s11222-025-10711-w","url":null,"abstract":"<p><p>In this paper, we introduce a novel approach that integrates Bayesian additive regression trees (BART) with the composite quantile regression (CQR) framework, creating a robust method for modeling complex relationships between predictors and outcomes under various error distributions. Unlike traditional quantile regression, which focuses on specific quantile levels, our proposed method, composite quantile BART, offers greater flexibility in capturing the entire conditional distribution of the response variable. By leveraging the strengths of BART and CQR, the proposed method provides enhanced predictive performance, especially in the presence of heavy-tailed errors and non-linear covariate effects. Numerical studies confirm that the proposed composite quantile BART method generally outperforms classical BART, quantile BART, and composite quantile linear regression models in terms of RMSE, especially under heavy-tailed or contaminated error distributions. Notably, under contaminated normal errors, it reduces RMSE by approximately 17% compared to composite quantile regression, and by 27% compared to classical BART.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 6","pages":"175"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12380950/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144969678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores. funBIalign:基于平均残基平方得分的功能主题发现分层算法。
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2024-12-10 DOI: 10.1007/s11222-024-10537-y
Jacopo Di Iorio, Marzia A Cremona, Francesca Chiaromonte

Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical "shapes" or "patterns" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.

Supplementary information: The online version contains supplementary material available at 10.1007/s11222-024-10537-y.

Motif发现在功能数据分析领域受到越来越多的关注。功能图案是典型的“形状”或“图案”,它们在单个曲线的不同部分和/或多个曲线的不对齐部分反复出现多次。在本文中,我们使用一个加法模型来定义功能基序,并提出了funBIalign来发现和评估它们。受聚类和双聚类技术的启发,funBIalign是一个多步骤的过程,它使用具有完整链接的聚集分层聚类和基于均方残差分数的功能距离来发现单个曲线(例如,时间序列)和一组曲线中的功能基序。我们评估了它的性能,并通过广泛的模拟与其他最近的方法进行了比较。此外,我们在两个实际数据案例研究中使用funBIalign来发现motif;一个是关于食品价格通胀,另一个是关于气温变化。补充资料:在线版本提供补充资料,网址为10.1007/s11222-024-10537-y。
{"title":"funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores.","authors":"Jacopo Di Iorio, Marzia A Cremona, Francesca Chiaromonte","doi":"10.1007/s11222-024-10537-y","DOIUrl":"10.1007/s11222-024-10537-y","url":null,"abstract":"<p><p>Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical \"shapes\" or \"patterns\" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose <i>funBIalign</i> for their discovery and evaluation. Inspired by clustering and biclustering techniques, <i>funBIalign</i> is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use <i>funBIalign</i> for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-024-10537-y.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 1","pages":"11"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632007/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142819226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hidden Markov models for multivariate panel data 多元面板数据的隐马尔可夫模型
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1007/s11222-024-10462-0
Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas

While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.

尽管基于模型的聚类技术不断进步,但在对面板数据等各种数据类型进行建模时仍面临挑战。多变量面板数据给聚类算法带来了困难,因为它们经常受到缺失数据和遗漏数据的困扰,给估计算法带来了问题。本研究提出了一系列隐马尔可夫模型,以弥补面板数据中出现的问题。本文提出了一种能够处理非随机数据缺失和遗漏的修正期望最大化算法,并将其用于模型估计。
{"title":"Hidden Markov models for multivariate panel data","authors":"Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas","doi":"10.1007/s11222-024-10462-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10462-0","url":null,"abstract":"<p>While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"20 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerated failure time models with error-prone response and nonlinear covariates 具有易出错响应和非线性协变量的加速故障时间模型
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1007/s11222-024-10491-9
Li-Pang Chen

As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.

作为生存分析的一个具体应用,医学研究的主要兴趣之一是分析特定癌症患者的生存时间。通常情况下,基因表达被视为协变量来描述生存时间。在生存分析框架中,参数形式的加速失效时间模型也许是一种常见的方法。然而,基因表达可能是非线性的,生存时间和普查状态也会受到测量误差的影响。本文旨在同时解决这些复杂的问题。我们首先修正了生存时间和普查状态的测量误差,并利用它们开发了一个修正的巴克利-詹姆斯估计器。之后,我们使用提升算法和三次样条估计方法迭代恢复协变量和生存时间之间的非线性关系。我们从理论上证明了测量误差校正和估计程序的有效性。数值研究表明,所提出的方法提高了估计的性能,并能捕捉到有信息量的协变量。该方法主要用于分析荷兰癌症研究所提供的乳腺癌研究数据。
{"title":"Accelerated failure time models with error-prone response and nonlinear covariates","authors":"Li-Pang Chen","doi":"10.1007/s11222-024-10491-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10491-9","url":null,"abstract":"<p>As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"19 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential model identification with reversible jump ensemble data assimilation method 采用可逆跃迁集合数据同化方法进行序列模型识别
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1007/s11222-024-10499-1
Yue Huan, Hai Xiang Lin

In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.

在数据同化(DA)方案中,除了一些需要估算的参数外,演化模型中表示过程的形式都是预先确定的。在某些应用中,如污染物溶质传输模型和储气库模型,演化模型中的方程模式无法从一开始就预先确定,可能会随着时间的推移而改变。我们提出了一种名为 "可逆跃迁集合滤波器(RJEnF)"的序列分析方法框架,用于识别演化模型随时间变化的支配模式。其主要思想是将可逆跃迁马尔可夫链蒙特卡洛(RJMCMC)方法引入数模转换方案,以适应演化模型模式未知且参数维度不断变化的情况。我们的框架允许我们识别演化模型中的模式及其变化,以及估计动态系统的参数和状态。我们进行了数值实验,结果表明我们的框架能有效识别底层演化模型,提高数模转换方法的预测精度。
{"title":"Sequential model identification with reversible jump ensemble data assimilation method","authors":"Yue Huan, Hai Xiang Lin","doi":"10.1007/s11222-024-10499-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10499-1","url":null,"abstract":"<p>In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"94 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shrinkage for extreme partial least-squares 极端部分最小二乘法的收缩
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1007/s11222-024-10490-w
Julyan Arbel, Stéphane Girard, Hadrien Lorenzo

This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.

这项研究的重点是条件极值建模的降维技术。具体来说,我们研究了这样一种观点,即响应变量的极值可以用输入随机向量的线性投影得出的非线性函数来解释。在此背景下,我们研究了极值偏最小二乘法(EPLS)对投影方向的估计,该方法是对原始偏最小二乘法(PLS)的改良,专门针对极值框架而设计。此外,利用应用于超球的 von Mises-Fisher 分布,引入了将 EPLS 方向解释为最大似然估计器的新方法。通过贝叶斯范式增强了维度缩减过程,从而将先验信息纳入投影方向估计。最大后验估计器在两种特定情况下得出,阐明了它是 EPLS 估计器的正则化或缩小。我们还确定了其在样本量接近无穷大时的渐近行为。为了评估我们提出的方法的实用性,我们进行了一项模拟数据研究。这清楚地表明,即使在高维设置下的中等数据问题中,该方法也非常有效。此外,我们还利用法国的农业收入数据举例说明了该方法的适用性,突出了它在现实世界中的功效。
{"title":"Shrinkage for extreme partial least-squares","authors":"Julyan Arbel, Stéphane Girard, Hadrien Lorenzo","doi":"10.1007/s11222-024-10490-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10490-w","url":null,"abstract":"<p>This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"205 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistics and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1