首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Bootstrap-based goodness-of-fit test for parametric families of conditional distributions 条件分布参数族的自举拟合优度检验
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-27 DOI: 10.1016/j.csda.2025.108289
Gitte Kremling, Gerhard Dikta
A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of Y. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.
介绍了分布回归的一致拟合优度检验。检验统计量基于跟踪y的边际分布函数的非参数估计和半参数估计之间的差异的过程。由于其渐近零分布不是无分布的,因此使用参数自举法来确定临界值。经验结果表明,在某些情况下,该测试通过获得更高的功率,从而对假设参数分布族的偏差提供更高的灵敏度,从而优于现有的规范测试。值得注意的是,提议的测试不涉及任何超参数,并且可以使用R中的gofreg-package轻松地应用于单个数据集。
{"title":"Bootstrap-based goodness-of-fit test for parametric families of conditional distributions","authors":"Gitte Kremling,&nbsp;Gerhard Dikta","doi":"10.1016/j.csda.2025.108289","DOIUrl":"10.1016/j.csda.2025.108289","url":null,"abstract":"<div><div>A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of <span><math><mi>Y</mi></math></span>. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108289"},"PeriodicalIF":1.6,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Resampling NANCOVA: Nonparametric analysis of covariance in small samples 重采样NANCOVA:小样本协方差的非参数分析
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-26 DOI: 10.1016/j.csda.2025.108290
Konstantin Emil Thiel , Paavo Sattler , Arne C. Bathke , Georg Zimmermann
Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on relative effects that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).
协方差分析是提高随机试验中因子效应统计检验精度的重要方法。然而,现有的解决方案存在以下一个或多个限制:(i)它们不适合有序数据(作为端点或解释变量);(ii)它们需要半参数模型假设;(iii)由于第一类错误控制往往较差,它们不适用于小数据场景;或者(iv)它们只提供近似测试程序,而(渐近的)精确测试缺失。研究了NANCOVA框架的重采样方法。NANCOVA是一个基于相对效应的完全非参数模型,允许任意数量的协变量和组,其中结果变量(终点)和协变量可以是度量的或有序的。在广泛的模拟中评估了新型NANCOVA检验和无协变量调整的非参数竞争检验。与NANCOVA框架中的近似测试不同,所提出的重采样版本在小样本场景中表现出良好的性能,并且很好地保持了标称的i型误差。重新采样NANCOVA也提供了一致的高功率:在4组和两个协变量的小样本场景中,与没有协变量调整的测试相比,高达26%。此外,重新采样NANCOVA提供了一个渐近精确的测试程序,使其成为目前NANCOVA框架中第一个具有良好有限样本性能的测试程序。总之,重新采样NANCOVA可以被认为是协方差分析克服问题(i) - (iv)的可行工具。
{"title":"Resampling NANCOVA: Nonparametric analysis of covariance in small samples","authors":"Konstantin Emil Thiel ,&nbsp;Paavo Sattler ,&nbsp;Arne C. Bathke ,&nbsp;Georg Zimmermann","doi":"10.1016/j.csda.2025.108290","DOIUrl":"10.1016/j.csda.2025.108290","url":null,"abstract":"<div><div>Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on <em>relative effects</em> that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108290"},"PeriodicalIF":1.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Change-point detection in regression models via the max-EM algorithm 基于max-EM算法的回归模型变点检测
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-24 DOI: 10.1016/j.csda.2025.108278
Modibo Diabaté , Grégory Nuel , Olivier Bouaziz
The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.
断点检测问题是在回归建模框架内考虑的。将约束隐马尔可夫模型与分类- em算法相结合,提出了一种新的分类- em算法。该算法具有线性复杂度,并提供了准确的断点检测和参数估计。理论结果表明,作为回归参数和断点位置的函数,数据的似然在算法的每一步都增加。为解决局部最大值问题,还提出了断点位置的两种初始化方法。最后,给出了单断点情况下的统计检验方法。基于线性、逻辑、泊松和加速失效时间回归模型的仿真实验表明,包含初始化过程和max-EM算法的最终方法在参数估计和断点检测方面都具有较强的性能。对统计检验也进行了评估,并在零假设下显示出正确的拒绝率,在各种替代方案下显示出强大的力量。分析了两个真实数据集,UCI自行车共享数据和健康疾病数据,其中说明了该方法检测数据分布异质性的兴趣。
{"title":"Change-point detection in regression models via the max-EM algorithm","authors":"Modibo Diabaté ,&nbsp;Grégory Nuel ,&nbsp;Olivier Bouaziz","doi":"10.1016/j.csda.2025.108278","DOIUrl":"10.1016/j.csda.2025.108278","url":null,"abstract":"<div><div>The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108278"},"PeriodicalIF":1.6,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast and efficient causal inference in large-scale data via subsampling and projection calibration 基于子采样和投影校准的大规模数据快速有效的因果推理
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-22 DOI: 10.1016/j.csda.2025.108281
Miaomiao Su
Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.
估计大规模数据集的平均处理效果面临着重大的计算和存储挑战。子抽样已成为缓解这些问题的关键策略。本文提出了一种新的基于g估计的子抽样方法,该方法具有双鲁棒性。该方法使用一小部分数据来估计计算复杂的干扰参数,同时利用完整的数据集进行计算简单的最终估计。为了保证得到的估计量对干扰参数的变化保持一阶不敏感,引入了一种投影方法来优化结果回归函数和处理回归函数的估计,使其满足内曼正交条件。结果表明,当正确指定处理模型或结果模型时,所得估计量是渐近正态的,并且与完全基于数据的估计量具有相同的收敛速率。此外,当两个模型都正确指定时,所提出的估计量与完全基于数据的估计量获得相同的渐近方差。通过模拟研究和对出生数据的应用,证明了所提出方法的有限样本性能,这些数据包括过去8年中收集的3000多万次观察结果。数值结果表明,该估计器的计算效率几乎与均匀次抽样估计器相当,而估计效率与基于全数据的g估计器相似。
{"title":"Fast and efficient causal inference in large-scale data via subsampling and projection calibration","authors":"Miaomiao Su","doi":"10.1016/j.csda.2025.108281","DOIUrl":"10.1016/j.csda.2025.108281","url":null,"abstract":"<div><div>Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108281"},"PeriodicalIF":1.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gamma approximation of stratified truncated exact test (GASTE-test) & application 分层截断精确检验(gaste检验)的伽玛近似及其应用
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-21 DOI: 10.1016/j.csda.2025.108277
Alexandre Wendling, Clovis Galiez
The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 × 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at https://github.com/AlexandreWen/gaste. A Python package is available on PyPI at https://pypi.org/project/gaste-test/
二元结果和特征的分析,如疫苗接种对健康的影响,通常依赖于2 × 2列联表。然而,年龄或性别等混杂因素需要通过创建子表进行分层分析,这在生物科学、流行病学和社会研究以及元分析中很常见。传统的跨地层关联性测试方法,如Cochran-Mantel-Haenszel (CMH)测试,难以适应样本量小和地层间影响的异质性。精确的测试可以解决这些问题,但是计算成本很高。为了解决这些挑战,提出了伽玛近似分层截断精确(GASTE)检验。它近似p值与离散支持组合的精确统计量,利用伽马分布近似分层下测试统计量的分布,提供快速和准确的p值计算,即使不同层的影响不同。GASTE方法保持高统计功率和低I型错误率,优于传统方法,提供更敏感和可靠的检测。它计算效率高,扩大了精确测试在分层二值数据研究领域的适用性。GASTE方法通过两个应用得到了证明:高山植物关联的生态学研究和1973年加州大学伯克利分校招生的案例研究。GASTE方法比传统方法有了实质性的改进。GASTE方法可以在https://github.com/AlexandreWen/gaste上以开源包的形式获得。Python包可以在PyPI上获得,网址是https://pypi.org/project/gaste-test/
{"title":"Gamma approximation of stratified truncated exact test (GASTE-test) & application","authors":"Alexandre Wendling,&nbsp;Clovis Galiez","doi":"10.1016/j.csda.2025.108277","DOIUrl":"10.1016/j.csda.2025.108277","url":null,"abstract":"<div><div>The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 <span><math><mo>×</mo></math></span> 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at <span><span>https://github.com/AlexandreWen/gaste</span><svg><path></path></svg></span>. A Python package is available on PyPI at <span><span>https://pypi.org/project/gaste-test/</span><svg><path></path></svg></span></div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108277"},"PeriodicalIF":1.6,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized composite multi-sample tests for high-dimensional data 高维数据的广义复合多样本检验
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-21 DOI: 10.1016/j.csda.2025.108279
Xiaoli Kong , Alejandro Villasante-Tezanos , David W. Fardo , Solomon W. Harrar
High-dimensional data is ubiquitous in studies involving omics, human movement, and imaging. A multivariate comparison method is proposed for such types of data when either the dimension or the replication size substantially exceeds the other. A testing procedure is introduced that centers and scales a composite measure of distance statistic among the samples to appropriately account for high dimensions and/or large sample sizes. The properties of the test statistic are examined both theoretically and empirically. The proposed procedure demonstrates superior performance in simulation studies and an application to confirm the involvement of previously identified genes in the stages of invasive breast cancer.
高维数据在涉及组学、人体运动和成像的研究中无处不在。当维度或复制大小大大超过另一个时,提出了这类数据的多变量比较方法。介绍了一种测试方法,该方法对样本间的距离统计量进行集中和缩放,以适当地考虑高维和/或大样本量。检验统计量的性质从理论上和经验上进行了检验。所提出的程序在模拟研究和应用中证明了卓越的性能,以确认先前确定的基因在浸润性乳腺癌阶段的参与。
{"title":"Generalized composite multi-sample tests for high-dimensional data","authors":"Xiaoli Kong ,&nbsp;Alejandro Villasante-Tezanos ,&nbsp;David W. Fardo ,&nbsp;Solomon W. Harrar","doi":"10.1016/j.csda.2025.108279","DOIUrl":"10.1016/j.csda.2025.108279","url":null,"abstract":"<div><div>High-dimensional data is ubiquitous in studies involving omics, human movement, and imaging. A multivariate comparison method is proposed for such types of data when either the dimension or the replication size substantially exceeds the other. A testing procedure is introduced that centers and scales a composite measure of distance statistic among the samples to appropriately account for high dimensions and/or large sample sizes. The properties of the test statistic are examined both theoretically and empirically. The proposed procedure demonstrates superior performance in simulation studies and an application to confirm the involvement of previously identified genes in the stages of invasive breast cancer.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108279"},"PeriodicalIF":1.6,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recursive nonparametric predictive for a discrete regression model 离散回归模型的递归非参数预测
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-16 DOI: 10.1016/j.csda.2025.108275
Lorenzo Cappello , Stephen G. Walker
A recursive algorithm is proposed to estimate a set of distribution functions indexed by a regressor variable. The procedure is fully nonparametric and has a Bayesian motivation and interpretation. Indeed, the recursive algorithm follows a certain Bayesian update, defined by the predictive distribution of a Dirichlet process mixture of linear regression models. Consistency of the algorithm is demonstrated under mild assumptions, and numerical accuracy in finite samples is shown via simulations and real data examples. The algorithm is very fast to implement, it is parallelizable, sequential, and requires limited computing power.
提出了一种估计由回归变量索引的一组分布函数的递归算法。该过程是完全非参数的,具有贝叶斯动机和解释。实际上,递归算法遵循一定的贝叶斯更新,由线性回归模型的Dirichlet过程混合的预测分布定义。在温和的假设条件下证明了算法的一致性,并通过仿真和实际数据实例证明了有限样本下的数值精度。该算法的实现速度非常快,具有并行性、顺序性,并且需要有限的计算能力。
{"title":"Recursive nonparametric predictive for a discrete regression model","authors":"Lorenzo Cappello ,&nbsp;Stephen G. Walker","doi":"10.1016/j.csda.2025.108275","DOIUrl":"10.1016/j.csda.2025.108275","url":null,"abstract":"<div><div>A recursive algorithm is proposed to estimate a set of distribution functions indexed by a regressor variable. The procedure is fully nonparametric and has a Bayesian motivation and interpretation. Indeed, the recursive algorithm follows a certain Bayesian update, defined by the predictive distribution of a Dirichlet process mixture of linear regression models. Consistency of the algorithm is demonstrated under mild assumptions, and numerical accuracy in finite samples is shown via simulations and real data examples. The algorithm is very fast to implement, it is parallelizable, sequential, and requires limited computing power.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108275"},"PeriodicalIF":1.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An algorithm for estimating threshold boundary regression models 估计阈值边界回归模型的算法
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-15 DOI: 10.1016/j.csda.2025.108274
Chih-Hao Chang , Takeshi Emura , Shih-Feng Huang
This paper presents an innovative iterative two-stage algorithm designed for estimating threshold boundary regression (TBR) models. By transforming the non-differentiable least-squares (LS) problem inherent in fitting TBR models into an optimization framework, our algorithm combines the optimization of a weighted classification error function for the threshold model with obtaining LS estimators for regression models. To improve the efficiency and flexibility of TBR model estimation, we integrate the weighted support vector machine (WSVM) as a surrogate method for solving the weighted classification problem. The TBR-WSVM algorithm offers several key advantages over recently developed methods: it eliminates pre-specification requirements for threshold parameters, accommodates flexible estimation of nonlinear threshold boundaries, and streamlines the estimation process. We conducted several simulation studies to illustrate the finite-sample performance of TBR-WSVM. Finally, we demonstrate the practical applicability of the TBR model through a real data analysis.
提出了一种新的阈值边界回归(TBR)模型估计迭代两阶段算法。该算法通过将TBR模型拟合中固有的不可微最小二乘问题转化为优化框架,将阈值模型的加权分类误差函数优化与回归模型的LS估计相结合。为了提高TBR模型估计的效率和灵活性,我们将加权支持向量机(WSVM)作为替代方法来解决加权分类问题。与最近开发的方法相比,TBR-WSVM算法提供了几个关键优势:它消除了对阈值参数的预规范要求,适应非线性阈值边界的灵活估计,并简化了估计过程。我们进行了一些仿真研究来说明TBR-WSVM的有限样本性能。最后,通过实际数据分析,验证了TBR模型的实用性。
{"title":"An algorithm for estimating threshold boundary regression models","authors":"Chih-Hao Chang ,&nbsp;Takeshi Emura ,&nbsp;Shih-Feng Huang","doi":"10.1016/j.csda.2025.108274","DOIUrl":"10.1016/j.csda.2025.108274","url":null,"abstract":"<div><div>This paper presents an innovative iterative two-stage algorithm designed for estimating threshold boundary regression (TBR) models. By transforming the non-differentiable least-squares (LS) problem inherent in fitting TBR models into an optimization framework, our algorithm combines the optimization of a weighted classification error function for the threshold model with obtaining LS estimators for regression models. To improve the efficiency and flexibility of TBR model estimation, we integrate the weighted support vector machine (WSVM) as a surrogate method for solving the weighted classification problem. The TBR-WSVM algorithm offers several key advantages over recently developed methods: it eliminates pre-specification requirements for threshold parameters, accommodates flexible estimation of nonlinear threshold boundaries, and streamlines the estimation process. We conducted several simulation studies to illustrate the finite-sample performance of TBR-WSVM. Finally, we demonstrate the practical applicability of the TBR model through a real data analysis.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108274"},"PeriodicalIF":1.6,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rate accelerated inference for integrals of multivariate random functions 多元随机函数积分的速率加速推理
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-11 DOI: 10.1016/j.csda.2025.108273
Valentin Patilea, Sunny G․ W․ Wang
The computation of integrals is a fundamental task in the analysis of functional data, where the data are typically considered as random elements in a space of squared integrable functions. Effective unbiased estimation and inference procedures are proposed for integrals of uni- and multivariate random functions. Applications to key problems in functional data analysis involving random design points are examined and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and standard numerical integration algorithms. The estimator also supports effective inference by generally providing better coverage with shorter confidence and prediction intervals in both noisy and noiseless settings.
积分计算是函数数据分析中的一项基本任务,其中数据通常被认为是平方可积函数空间中的随机元素。针对单变量和多元随机函数的积分,提出了有效的无偏估计和推理方法。应用程序的关键问题,在功能数据分析涉及随机设计点进行了检查和说明。在没有噪声的情况下,所提出的估计比样本均值和标准数值积分算法收敛得更快。该估计器还通过在有噪声和无噪声设置中以更短的置信度和预测间隔提供更好的覆盖范围来支持有效的推理。
{"title":"Rate accelerated inference for integrals of multivariate random functions","authors":"Valentin Patilea,&nbsp;Sunny G․ W․ Wang","doi":"10.1016/j.csda.2025.108273","DOIUrl":"10.1016/j.csda.2025.108273","url":null,"abstract":"<div><div>The computation of integrals is a fundamental task in the analysis of functional data, where the data are typically considered as random elements in a space of squared integrable functions. Effective unbiased estimation and inference procedures are proposed for integrals of uni- and multivariate random functions. Applications to key problems in functional data analysis involving random design points are examined and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and standard numerical integration algorithms. The estimator also supports effective inference by generally providing better coverage with shorter confidence and prediction intervals in both noisy and noiseless settings.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108273"},"PeriodicalIF":1.6,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust selection of the number of change-points via FDR control 通过FDR控制鲁棒选择改变点的数量
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-10 DOI: 10.1016/j.csda.2025.108272
Hui Chen , Chengde Qian , Qin Zhou
Robust quantification of uncertainty regarding the number of change-points presents a significant challenge in data analysis, particularly when employing false discovery rate (FDR) control techniques. Emphasizing the detection of genuine signals while controlling false positives is crucial, especially for identifying shifts in location parameters within flexible distributions. Traditional parametric methods often exhibit sensitivity to outliers and heavy-tailed data. Addressing this limitation, a robust method accommodating diverse data structures is proposed. The approach constructs component-wise sign-based statistics. Leveraging the global symmetry inherent in these statistics enables the derivation of data-driven thresholds suitable for multiple testing scenarios. Method development occurs within the framework of U-statistics, which naturally encompasses existing cumulative sum-based procedures. Theoretical guarantees establish FDR control for the component-wise sign-based method under mild assumptions. Demonstrations of effectiveness utilize simulations with synthetic data and analyses of real data.
关于变化点数量的不确定性的稳健量化在数据分析中提出了重大挑战,特别是在采用错误发现率(FDR)控制技术时。在控制误报的同时强调检测真实信号是至关重要的,特别是在灵活分布中识别位置参数的变化。传统的参数化方法往往对异常值和重尾数据敏感。针对这一限制,提出了一种适应多种数据结构的鲁棒方法。该方法构建基于组件的符号统计。利用这些统计数据中固有的全局对称性,可以推导出适合于多个测试场景的数据驱动的阈值。方法开发是在U-statistics框架内进行的,它自然包含现有的基于累积和的程序。理论保证在温和的假设下为基于组件的符号方法建立FDR控制。有效性的演示利用模拟合成数据和分析真实数据。
{"title":"Robust selection of the number of change-points via FDR control","authors":"Hui Chen ,&nbsp;Chengde Qian ,&nbsp;Qin Zhou","doi":"10.1016/j.csda.2025.108272","DOIUrl":"10.1016/j.csda.2025.108272","url":null,"abstract":"<div><div>Robust quantification of uncertainty regarding the number of change-points presents a significant challenge in data analysis, particularly when employing false discovery rate (FDR) control techniques. Emphasizing the detection of genuine signals while controlling false positives is crucial, especially for identifying shifts in location parameters within flexible distributions. Traditional parametric methods often exhibit sensitivity to outliers and heavy-tailed data. Addressing this limitation, a robust method accommodating diverse data structures is proposed. The approach constructs component-wise sign-based statistics. Leveraging the global symmetry inherent in these statistics enables the derivation of data-driven thresholds suitable for multiple testing scenarios. Method development occurs within the framework of U-statistics, which naturally encompasses existing cumulative sum-based procedures. Theoretical guarantees establish FDR control for the component-wise sign-based method under mild assumptions. Demonstrations of effectiveness utilize simulations with synthetic data and analyses of real data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108272"},"PeriodicalIF":1.6,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1