Pub Date : 2025-09-27DOI: 10.1016/j.csda.2025.108289
Gitte Kremling, Gerhard Dikta
A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of . As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.
{"title":"Bootstrap-based goodness-of-fit test for parametric families of conditional distributions","authors":"Gitte Kremling, Gerhard Dikta","doi":"10.1016/j.csda.2025.108289","DOIUrl":"10.1016/j.csda.2025.108289","url":null,"abstract":"<div><div>A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of <span><math><mi>Y</mi></math></span>. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108289"},"PeriodicalIF":1.6,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-26DOI: 10.1016/j.csda.2025.108290
Konstantin Emil Thiel , Paavo Sattler , Arne C. Bathke , Georg Zimmermann
Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on relative effects that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).
{"title":"Resampling NANCOVA: Nonparametric analysis of covariance in small samples","authors":"Konstantin Emil Thiel , Paavo Sattler , Arne C. Bathke , Georg Zimmermann","doi":"10.1016/j.csda.2025.108290","DOIUrl":"10.1016/j.csda.2025.108290","url":null,"abstract":"<div><div>Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on <em>relative effects</em> that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108290"},"PeriodicalIF":1.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-24DOI: 10.1016/j.csda.2025.108278
Modibo Diabaté , Grégory Nuel , Olivier Bouaziz
The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.
{"title":"Change-point detection in regression models via the max-EM algorithm","authors":"Modibo Diabaté , Grégory Nuel , Olivier Bouaziz","doi":"10.1016/j.csda.2025.108278","DOIUrl":"10.1016/j.csda.2025.108278","url":null,"abstract":"<div><div>The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108278"},"PeriodicalIF":1.6,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1016/j.csda.2025.108281
Miaomiao Su
Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.
{"title":"Fast and efficient causal inference in large-scale data via subsampling and projection calibration","authors":"Miaomiao Su","doi":"10.1016/j.csda.2025.108281","DOIUrl":"10.1016/j.csda.2025.108281","url":null,"abstract":"<div><div>Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108281"},"PeriodicalIF":1.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-21DOI: 10.1016/j.csda.2025.108277
Alexandre Wendling, Clovis Galiez
The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at https://github.com/AlexandreWen/gaste. A Python package is available on PyPI at https://pypi.org/project/gaste-test/
{"title":"Gamma approximation of stratified truncated exact test (GASTE-test) & application","authors":"Alexandre Wendling, Clovis Galiez","doi":"10.1016/j.csda.2025.108277","DOIUrl":"10.1016/j.csda.2025.108277","url":null,"abstract":"<div><div>The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 <span><math><mo>×</mo></math></span> 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at <span><span>https://github.com/AlexandreWen/gaste</span><svg><path></path></svg></span>. A Python package is available on PyPI at <span><span>https://pypi.org/project/gaste-test/</span><svg><path></path></svg></span></div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108277"},"PeriodicalIF":1.6,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-21DOI: 10.1016/j.csda.2025.108279
Xiaoli Kong , Alejandro Villasante-Tezanos , David W. Fardo , Solomon W. Harrar
High-dimensional data is ubiquitous in studies involving omics, human movement, and imaging. A multivariate comparison method is proposed for such types of data when either the dimension or the replication size substantially exceeds the other. A testing procedure is introduced that centers and scales a composite measure of distance statistic among the samples to appropriately account for high dimensions and/or large sample sizes. The properties of the test statistic are examined both theoretically and empirically. The proposed procedure demonstrates superior performance in simulation studies and an application to confirm the involvement of previously identified genes in the stages of invasive breast cancer.
{"title":"Generalized composite multi-sample tests for high-dimensional data","authors":"Xiaoli Kong , Alejandro Villasante-Tezanos , David W. Fardo , Solomon W. Harrar","doi":"10.1016/j.csda.2025.108279","DOIUrl":"10.1016/j.csda.2025.108279","url":null,"abstract":"<div><div>High-dimensional data is ubiquitous in studies involving omics, human movement, and imaging. A multivariate comparison method is proposed for such types of data when either the dimension or the replication size substantially exceeds the other. A testing procedure is introduced that centers and scales a composite measure of distance statistic among the samples to appropriately account for high dimensions and/or large sample sizes. The properties of the test statistic are examined both theoretically and empirically. The proposed procedure demonstrates superior performance in simulation studies and an application to confirm the involvement of previously identified genes in the stages of invasive breast cancer.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108279"},"PeriodicalIF":1.6,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1016/j.csda.2025.108275
Lorenzo Cappello , Stephen G. Walker
A recursive algorithm is proposed to estimate a set of distribution functions indexed by a regressor variable. The procedure is fully nonparametric and has a Bayesian motivation and interpretation. Indeed, the recursive algorithm follows a certain Bayesian update, defined by the predictive distribution of a Dirichlet process mixture of linear regression models. Consistency of the algorithm is demonstrated under mild assumptions, and numerical accuracy in finite samples is shown via simulations and real data examples. The algorithm is very fast to implement, it is parallelizable, sequential, and requires limited computing power.
{"title":"Recursive nonparametric predictive for a discrete regression model","authors":"Lorenzo Cappello , Stephen G. Walker","doi":"10.1016/j.csda.2025.108275","DOIUrl":"10.1016/j.csda.2025.108275","url":null,"abstract":"<div><div>A recursive algorithm is proposed to estimate a set of distribution functions indexed by a regressor variable. The procedure is fully nonparametric and has a Bayesian motivation and interpretation. Indeed, the recursive algorithm follows a certain Bayesian update, defined by the predictive distribution of a Dirichlet process mixture of linear regression models. Consistency of the algorithm is demonstrated under mild assumptions, and numerical accuracy in finite samples is shown via simulations and real data examples. The algorithm is very fast to implement, it is parallelizable, sequential, and requires limited computing power.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108275"},"PeriodicalIF":1.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-15DOI: 10.1016/j.csda.2025.108274
Chih-Hao Chang , Takeshi Emura , Shih-Feng Huang
This paper presents an innovative iterative two-stage algorithm designed for estimating threshold boundary regression (TBR) models. By transforming the non-differentiable least-squares (LS) problem inherent in fitting TBR models into an optimization framework, our algorithm combines the optimization of a weighted classification error function for the threshold model with obtaining LS estimators for regression models. To improve the efficiency and flexibility of TBR model estimation, we integrate the weighted support vector machine (WSVM) as a surrogate method for solving the weighted classification problem. The TBR-WSVM algorithm offers several key advantages over recently developed methods: it eliminates pre-specification requirements for threshold parameters, accommodates flexible estimation of nonlinear threshold boundaries, and streamlines the estimation process. We conducted several simulation studies to illustrate the finite-sample performance of TBR-WSVM. Finally, we demonstrate the practical applicability of the TBR model through a real data analysis.
{"title":"An algorithm for estimating threshold boundary regression models","authors":"Chih-Hao Chang , Takeshi Emura , Shih-Feng Huang","doi":"10.1016/j.csda.2025.108274","DOIUrl":"10.1016/j.csda.2025.108274","url":null,"abstract":"<div><div>This paper presents an innovative iterative two-stage algorithm designed for estimating threshold boundary regression (TBR) models. By transforming the non-differentiable least-squares (LS) problem inherent in fitting TBR models into an optimization framework, our algorithm combines the optimization of a weighted classification error function for the threshold model with obtaining LS estimators for regression models. To improve the efficiency and flexibility of TBR model estimation, we integrate the weighted support vector machine (WSVM) as a surrogate method for solving the weighted classification problem. The TBR-WSVM algorithm offers several key advantages over recently developed methods: it eliminates pre-specification requirements for threshold parameters, accommodates flexible estimation of nonlinear threshold boundaries, and streamlines the estimation process. We conducted several simulation studies to illustrate the finite-sample performance of TBR-WSVM. Finally, we demonstrate the practical applicability of the TBR model through a real data analysis.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108274"},"PeriodicalIF":1.6,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1016/j.csda.2025.108273
Valentin Patilea, Sunny G․ W․ Wang
The computation of integrals is a fundamental task in the analysis of functional data, where the data are typically considered as random elements in a space of squared integrable functions. Effective unbiased estimation and inference procedures are proposed for integrals of uni- and multivariate random functions. Applications to key problems in functional data analysis involving random design points are examined and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and standard numerical integration algorithms. The estimator also supports effective inference by generally providing better coverage with shorter confidence and prediction intervals in both noisy and noiseless settings.
{"title":"Rate accelerated inference for integrals of multivariate random functions","authors":"Valentin Patilea, Sunny G․ W․ Wang","doi":"10.1016/j.csda.2025.108273","DOIUrl":"10.1016/j.csda.2025.108273","url":null,"abstract":"<div><div>The computation of integrals is a fundamental task in the analysis of functional data, where the data are typically considered as random elements in a space of squared integrable functions. Effective unbiased estimation and inference procedures are proposed for integrals of uni- and multivariate random functions. Applications to key problems in functional data analysis involving random design points are examined and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and standard numerical integration algorithms. The estimator also supports effective inference by generally providing better coverage with shorter confidence and prediction intervals in both noisy and noiseless settings.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108273"},"PeriodicalIF":1.6,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-10DOI: 10.1016/j.csda.2025.108272
Hui Chen , Chengde Qian , Qin Zhou
Robust quantification of uncertainty regarding the number of change-points presents a significant challenge in data analysis, particularly when employing false discovery rate (FDR) control techniques. Emphasizing the detection of genuine signals while controlling false positives is crucial, especially for identifying shifts in location parameters within flexible distributions. Traditional parametric methods often exhibit sensitivity to outliers and heavy-tailed data. Addressing this limitation, a robust method accommodating diverse data structures is proposed. The approach constructs component-wise sign-based statistics. Leveraging the global symmetry inherent in these statistics enables the derivation of data-driven thresholds suitable for multiple testing scenarios. Method development occurs within the framework of U-statistics, which naturally encompasses existing cumulative sum-based procedures. Theoretical guarantees establish FDR control for the component-wise sign-based method under mild assumptions. Demonstrations of effectiveness utilize simulations with synthetic data and analyses of real data.
{"title":"Robust selection of the number of change-points via FDR control","authors":"Hui Chen , Chengde Qian , Qin Zhou","doi":"10.1016/j.csda.2025.108272","DOIUrl":"10.1016/j.csda.2025.108272","url":null,"abstract":"<div><div>Robust quantification of uncertainty regarding the number of change-points presents a significant challenge in data analysis, particularly when employing false discovery rate (FDR) control techniques. Emphasizing the detection of genuine signals while controlling false positives is crucial, especially for identifying shifts in location parameters within flexible distributions. Traditional parametric methods often exhibit sensitivity to outliers and heavy-tailed data. Addressing this limitation, a robust method accommodating diverse data structures is proposed. The approach constructs component-wise sign-based statistics. Leveraging the global symmetry inherent in these statistics enables the derivation of data-driven thresholds suitable for multiple testing scenarios. Method development occurs within the framework of U-statistics, which naturally encompasses existing cumulative sum-based procedures. Theoretical guarantees establish FDR control for the component-wise sign-based method under mild assumptions. Demonstrations of effectiveness utilize simulations with synthetic data and analyses of real data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108272"},"PeriodicalIF":1.6,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}