Pub Date : 2025-10-16DOI: 10.1016/j.csda.2025.108292
Zejing Zheng, Shengbing Zheng, Junlong Zhao
Discrete responses are frequently encountered in applications, particularly in classification problems. However, the high cost of collecting responses or labels often leads to a scarcity of samples, which significantly diminishes the accuracy of statistical inferences, particularly in high-dimensional settings. To address this limitation, transfer learning can be utilized for high-dimensional data with discrete responses by incorporating relevant source data into the target study of interest. Within the framework of generalized linear models, the case where responses are bounded are first considered, and an importance-weighted transfer learning method, referred to as IWTL-DR, is proposed. This method selects data at the individual level, thereby utilizing the source data more efficiently. Subsequently, this approach is extended to scenarios involving unbounded responses. Theoretical properties of the IWTL-DR method are established and compared with existing techniques. Extensive simulations and analyses of real data show the advantages of our approach.
{"title":"Transfer learning for high dimensional data with discrete responses","authors":"Zejing Zheng, Shengbing Zheng, Junlong Zhao","doi":"10.1016/j.csda.2025.108292","DOIUrl":"10.1016/j.csda.2025.108292","url":null,"abstract":"<div><div>Discrete responses are frequently encountered in applications, particularly in classification problems. However, the high cost of collecting responses or labels often leads to a scarcity of samples, which significantly diminishes the accuracy of statistical inferences, particularly in high-dimensional settings. To address this limitation, transfer learning can be utilized for high-dimensional data with discrete responses by incorporating relevant source data into the target study of interest. Within the framework of generalized linear models, the case where responses are bounded are first considered, and an importance-weighted transfer learning method, referred to as IWTL-DR, is proposed. This method selects data at the individual level, thereby utilizing the source data more efficiently. Subsequently, this approach is extended to scenarios involving unbounded responses. Theoretical properties of the IWTL-DR method are established and compared with existing techniques. Extensive simulations and analyses of real data show the advantages of our approach.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108292"},"PeriodicalIF":1.6,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1016/j.csda.2025.108291
Lei Qin , Xiaomei Zhang , Yingqiu Zhu , Yang Chen , Ben-Chang Shia
As time series with matrix structures becoming more and more common in the fields of finance, economics, and management, modeling matrix-valued time series becomes an emerging research hotspot. Spatial effects lead by different locations play an important role in the analysis of time series. Although matrix autoregressive model (MAR) provides a promising solution for modeling matrix-valued time series, it only models the dynamic effects in the temporal dimension, without capturing the spatial effects. In this paper, we propose a bilateral matrix spatiotemporal autoregressive model (BMSAR), which fully considers the pure spatial effects, pure dynamic effects, and time-delay spatial effects while maintaining and utilizing the matrix structure. In order to solve the endogeneity problem, the estimation process for BMSAR is based on the least squares method and the Yule-Walker equation for iterative estimation. The simulation results show that as compared with the MAR, the BMSAR model effectively reflects the impact of spatial structure on the sequence observations. The estimator for BMSAR proposed in this paper is consistent. It achieves promising performance when the sample size is relatively large. The proposed model and algorithm are also verified using the trade and macroeconomic indicator datasets of seven countries in the G7 summit, and the prediction accuracy is significantly improved as compared with the existing models.
{"title":"Bilateral matrix spatiotemporal autoregressive model","authors":"Lei Qin , Xiaomei Zhang , Yingqiu Zhu , Yang Chen , Ben-Chang Shia","doi":"10.1016/j.csda.2025.108291","DOIUrl":"10.1016/j.csda.2025.108291","url":null,"abstract":"<div><div>As time series with matrix structures becoming more and more common in the fields of finance, economics, and management, modeling matrix-valued time series becomes an emerging research hotspot. Spatial effects lead by different locations play an important role in the analysis of time series. Although matrix autoregressive model (MAR) provides a promising solution for modeling matrix-valued time series, it only models the dynamic effects in the temporal dimension, without capturing the spatial effects. In this paper, we propose a bilateral matrix spatiotemporal autoregressive model (BMSAR), which fully considers the pure spatial effects, pure dynamic effects, and time-delay spatial effects while maintaining and utilizing the matrix structure. In order to solve the endogeneity problem, the estimation process for BMSAR is based on the least squares method and the Yule-Walker equation for iterative estimation. The simulation results show that as compared with the MAR, the BMSAR model effectively reflects the impact of spatial structure on the sequence observations. The estimator for BMSAR proposed in this paper is consistent. It achieves promising performance when the sample size is relatively large. The proposed model and algorithm are also verified using the trade and macroeconomic indicator datasets of seven countries in the G7 summit, and the prediction accuracy is significantly improved as compared with the existing models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108291"},"PeriodicalIF":1.6,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1016/j.csda.2025.108276
Claire Donnat , Olga Klopp , Nicolas Verzelen
A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.
{"title":"Denoising over networks with applications to partially observed epidemics","authors":"Claire Donnat , Olga Klopp , Nicolas Verzelen","doi":"10.1016/j.csda.2025.108276","DOIUrl":"10.1016/j.csda.2025.108276","url":null,"abstract":"<div><div>A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108276"},"PeriodicalIF":1.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1016/j.csda.2025.108288
Jia-Han Shih , Yi-Hau Chen
A regression association measure is proposed for capturing predictability of a multivariate outcome from a multivariate covariate . Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of from is measured by the generalized multivariate Kendall’s tau between and , where and share the same conditional distribution and are conditionally independent given . The proposed regression association measure can be expressed as the proportion of the variance of a function of that can be explained by , indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on and , a simple nonparametric estimation method based on nearest neighbors is available. An R package, MRAM, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.
{"title":"Measuring multivariate regression association via spatial sign","authors":"Jia-Han Shih , Yi-Hau Chen","doi":"10.1016/j.csda.2025.108288","DOIUrl":"10.1016/j.csda.2025.108288","url":null,"abstract":"<div><div>A regression association measure is proposed for capturing predictability of a multivariate outcome <span><math><mrow><mi>Y</mi><mo>=</mo><mo>(</mo><msub><mi>Y</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>Y</mi><mi>d</mi></msub><mo>)</mo></mrow></math></span> from a multivariate covariate <span><math><mrow><mi>X</mi><mo>=</mo><mo>(</mo><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>X</mi><mi>p</mi></msub><mo>)</mo></mrow></math></span>. Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of <span><math><mi>Y</mi></math></span> from <span><math><mi>X</mi></math></span> is measured by the generalized multivariate Kendall’s tau between <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, where <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span> share the same conditional distribution and are conditionally independent given <span><math><mi>X</mi></math></span>. The proposed regression association measure can be expressed as the proportion of the variance of a function of <span><math><mi>Y</mi></math></span> that can be explained by <span><math><mi>X</mi></math></span>, indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, a simple nonparametric estimation method based on nearest neighbors is available. An R package, <span>MRAM</span>, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108288"},"PeriodicalIF":1.6,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-29DOI: 10.1016/j.csda.2025.108280
Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen
Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in allele carriers and non-carriers of the apolipoprotein E (APOE) gene.
{"title":"Fast autoregressive model for multivariate dependent outcomes with application to lipidomics analysis for Alzheimer’s disease and APOE-ε4","authors":"Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen","doi":"10.1016/j.csda.2025.108280","DOIUrl":"10.1016/j.csda.2025.108280","url":null,"abstract":"<div><div>Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in <span><math><mrow><mrow><mi>ε</mi></mrow><mn>4</mn></mrow></math></span> allele carriers and non-carriers of the apolipoprotein E (APOE) gene.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108280"},"PeriodicalIF":1.6,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-27DOI: 10.1016/j.csda.2025.108289
Gitte Kremling, Gerhard Dikta
A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of . As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.
{"title":"Bootstrap-based goodness-of-fit test for parametric families of conditional distributions","authors":"Gitte Kremling, Gerhard Dikta","doi":"10.1016/j.csda.2025.108289","DOIUrl":"10.1016/j.csda.2025.108289","url":null,"abstract":"<div><div>A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of <span><math><mi>Y</mi></math></span>. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108289"},"PeriodicalIF":1.6,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-26DOI: 10.1016/j.csda.2025.108290
Konstantin Emil Thiel , Paavo Sattler , Arne C. Bathke , Georg Zimmermann
Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on relative effects that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).
{"title":"Resampling NANCOVA: Nonparametric analysis of covariance in small samples","authors":"Konstantin Emil Thiel , Paavo Sattler , Arne C. Bathke , Georg Zimmermann","doi":"10.1016/j.csda.2025.108290","DOIUrl":"10.1016/j.csda.2025.108290","url":null,"abstract":"<div><div>Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on <em>relative effects</em> that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108290"},"PeriodicalIF":1.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-24DOI: 10.1016/j.csda.2025.108278
Modibo Diabaté , Grégory Nuel , Olivier Bouaziz
The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.
{"title":"Change-point detection in regression models via the max-EM algorithm","authors":"Modibo Diabaté , Grégory Nuel , Olivier Bouaziz","doi":"10.1016/j.csda.2025.108278","DOIUrl":"10.1016/j.csda.2025.108278","url":null,"abstract":"<div><div>The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108278"},"PeriodicalIF":1.6,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1016/j.csda.2025.108281
Miaomiao Su
Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.
{"title":"Fast and efficient causal inference in large-scale data via subsampling and projection calibration","authors":"Miaomiao Su","doi":"10.1016/j.csda.2025.108281","DOIUrl":"10.1016/j.csda.2025.108281","url":null,"abstract":"<div><div>Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108281"},"PeriodicalIF":1.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-21DOI: 10.1016/j.csda.2025.108277
Alexandre Wendling, Clovis Galiez
The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at https://github.com/AlexandreWen/gaste. A Python package is available on PyPI at https://pypi.org/project/gaste-test/
{"title":"Gamma approximation of stratified truncated exact test (GASTE-test) & application","authors":"Alexandre Wendling, Clovis Galiez","doi":"10.1016/j.csda.2025.108277","DOIUrl":"10.1016/j.csda.2025.108277","url":null,"abstract":"<div><div>The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 <span><math><mo>×</mo></math></span> 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at <span><span>https://github.com/AlexandreWen/gaste</span><svg><path></path></svg></span>. A Python package is available on PyPI at <span><span>https://pypi.org/project/gaste-test/</span><svg><path></path></svg></span></div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108277"},"PeriodicalIF":1.6,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}