首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Transfer learning for high dimensional data with discrete responses 具有离散响应的高维数据迁移学习
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-16 DOI: 10.1016/j.csda.2025.108292
Zejing Zheng, Shengbing Zheng, Junlong Zhao
Discrete responses are frequently encountered in applications, particularly in classification problems. However, the high cost of collecting responses or labels often leads to a scarcity of samples, which significantly diminishes the accuracy of statistical inferences, particularly in high-dimensional settings. To address this limitation, transfer learning can be utilized for high-dimensional data with discrete responses by incorporating relevant source data into the target study of interest. Within the framework of generalized linear models, the case where responses are bounded are first considered, and an importance-weighted transfer learning method, referred to as IWTL-DR, is proposed. This method selects data at the individual level, thereby utilizing the source data more efficiently. Subsequently, this approach is extended to scenarios involving unbounded responses. Theoretical properties of the IWTL-DR method are established and compared with existing techniques. Extensive simulations and analyses of real data show the advantages of our approach.
离散响应在应用中经常遇到,特别是在分类问题中。然而,收集响应或标签的高成本往往导致样本稀缺,这大大降低了统计推断的准确性,特别是在高维环境中。为了解决这一限制,迁移学习可以通过将相关源数据合并到感兴趣的目标研究中来用于具有离散响应的高维数据。在广义线性模型的框架内,首先考虑了响应有界的情况,提出了一种重要性加权迁移学习方法,称为IWTL-DR。该方法在个人级别选择数据,从而更有效地利用源数据。随后,将该方法扩展到涉及无界响应的场景。建立了IWTL-DR方法的理论性质,并与现有方法进行了比较。大量的模拟和实际数据分析表明了我们的方法的优势。
{"title":"Transfer learning for high dimensional data with discrete responses","authors":"Zejing Zheng,&nbsp;Shengbing Zheng,&nbsp;Junlong Zhao","doi":"10.1016/j.csda.2025.108292","DOIUrl":"10.1016/j.csda.2025.108292","url":null,"abstract":"<div><div>Discrete responses are frequently encountered in applications, particularly in classification problems. However, the high cost of collecting responses or labels often leads to a scarcity of samples, which significantly diminishes the accuracy of statistical inferences, particularly in high-dimensional settings. To address this limitation, transfer learning can be utilized for high-dimensional data with discrete responses by incorporating relevant source data into the target study of interest. Within the framework of generalized linear models, the case where responses are bounded are first considered, and an importance-weighted transfer learning method, referred to as IWTL-DR, is proposed. This method selects data at the individual level, thereby utilizing the source data more efficiently. Subsequently, this approach is extended to scenarios involving unbounded responses. Theoretical properties of the IWTL-DR method are established and compared with existing techniques. Extensive simulations and analyses of real data show the advantages of our approach.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108292"},"PeriodicalIF":1.6,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bilateral matrix spatiotemporal autoregressive model 双边矩阵时空自回归模型
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-15 DOI: 10.1016/j.csda.2025.108291
Lei Qin , Xiaomei Zhang , Yingqiu Zhu , Yang Chen , Ben-Chang Shia
As time series with matrix structures becoming more and more common in the fields of finance, economics, and management, modeling matrix-valued time series becomes an emerging research hotspot. Spatial effects lead by different locations play an important role in the analysis of time series. Although matrix autoregressive model (MAR) provides a promising solution for modeling matrix-valued time series, it only models the dynamic effects in the temporal dimension, without capturing the spatial effects. In this paper, we propose a bilateral matrix spatiotemporal autoregressive model (BMSAR), which fully considers the pure spatial effects, pure dynamic effects, and time-delay spatial effects while maintaining and utilizing the matrix structure. In order to solve the endogeneity problem, the estimation process for BMSAR is based on the least squares method and the Yule-Walker equation for iterative estimation. The simulation results show that as compared with the MAR, the BMSAR model effectively reflects the impact of spatial structure on the sequence observations. The estimator for BMSAR proposed in this paper is consistent. It achieves promising performance when the sample size is relatively large. The proposed model and algorithm are also verified using the trade and macroeconomic indicator datasets of seven countries in the G7 summit, and the prediction accuracy is significantly improved as compared with the existing models.
随着具有矩阵结构的时间序列在金融、经济、管理等领域的应用越来越广泛,矩阵值时间序列的建模成为一个新兴的研究热点。不同地点导致的空间效应在时间序列分析中起着重要作用。虽然矩阵自回归模型(matrix autoregressive model, MAR)为矩阵值时间序列的建模提供了一种很有前途的解决方案,但它只模拟了时间维度上的动态效应,而没有捕捉到空间效应。本文提出了一个双边矩阵时空自回归模型(BMSAR),该模型在保持和利用矩阵结构的同时,充分考虑了纯空间效应、纯动态效应和时滞空间效应。为了解决内生性问题,BMSAR的估计过程基于最小二乘法和Yule-Walker方程迭代估计。仿真结果表明,与MAR模型相比,BMSAR模型能有效地反映空间结构对序列观测的影响。本文提出的BMSAR估计量是一致的。在样本量较大的情况下,该方法取得了良好的性能。利用G7峰会七国的贸易和宏观经济指标数据集对模型和算法进行了验证,与现有模型相比,预测精度显著提高。
{"title":"Bilateral matrix spatiotemporal autoregressive model","authors":"Lei Qin ,&nbsp;Xiaomei Zhang ,&nbsp;Yingqiu Zhu ,&nbsp;Yang Chen ,&nbsp;Ben-Chang Shia","doi":"10.1016/j.csda.2025.108291","DOIUrl":"10.1016/j.csda.2025.108291","url":null,"abstract":"<div><div>As time series with matrix structures becoming more and more common in the fields of finance, economics, and management, modeling matrix-valued time series becomes an emerging research hotspot. Spatial effects lead by different locations play an important role in the analysis of time series. Although matrix autoregressive model (MAR) provides a promising solution for modeling matrix-valued time series, it only models the dynamic effects in the temporal dimension, without capturing the spatial effects. In this paper, we propose a bilateral matrix spatiotemporal autoregressive model (BMSAR), which fully considers the pure spatial effects, pure dynamic effects, and time-delay spatial effects while maintaining and utilizing the matrix structure. In order to solve the endogeneity problem, the estimation process for BMSAR is based on the least squares method and the Yule-Walker equation for iterative estimation. The simulation results show that as compared with the MAR, the BMSAR model effectively reflects the impact of spatial structure on the sequence observations. The estimator for BMSAR proposed in this paper is consistent. It achieves promising performance when the sample size is relatively large. The proposed model and algorithm are also verified using the trade and macroeconomic indicator datasets of seven countries in the G7 summit, and the prediction accuracy is significantly improved as compared with the existing models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108291"},"PeriodicalIF":1.6,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Denoising over networks with applications to partially observed epidemics 应用于部分观测到的流行病的网络去噪
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-10 DOI: 10.1016/j.csda.2025.108276
Claire Donnat , Olga Klopp , Nicolas Verzelen
A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.
介绍了一种利用图全变分(TV)正则化技术对网络上部分观测信号进行去噪的新方法。这种方法将高斯数据的现有结果扩展到离散的二进制情况-一种称为“一位电视去噪”的方法。该框架将网络视为一组具有二进制观测值的节点,其中边编码节点之间的成对关系。一个关键的理论贡献是为恢复底层节点级概率建立了图电视去噪的一致性保证。该方法非常适合于缺少数据的设置,可以从不完整的观测中进行稳健的推断。大量的数值实验和实际应用进一步强调了它的有效性,强调了它在各种实际场景中的潜力,这些场景需要对具有二值数据的网络进行去噪和预测。最后,对两种真实疫情场景的应用表明,1位总变差去噪显著提高了基于网络的临近预报和预报的准确性。
{"title":"Denoising over networks with applications to partially observed epidemics","authors":"Claire Donnat ,&nbsp;Olga Klopp ,&nbsp;Nicolas Verzelen","doi":"10.1016/j.csda.2025.108276","DOIUrl":"10.1016/j.csda.2025.108276","url":null,"abstract":"<div><div>A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108276"},"PeriodicalIF":1.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring multivariate regression association via spatial sign 通过空间符号测量多元回归关联
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-03 DOI: 10.1016/j.csda.2025.108288
Jia-Han Shih , Yi-Hau Chen
A regression association measure is proposed for capturing predictability of a multivariate outcome Y=(Y1,,Yd) from a multivariate covariate X=(X1,,Xp). Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of Y from X is measured by the generalized multivariate Kendall’s tau between Y and Y, where Y and Y share the same conditional distribution and are conditionally independent given X. The proposed regression association measure can be expressed as the proportion of the variance of a function of Y that can be explained by X, indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on Y and Y, a simple nonparametric estimation method based on nearest neighbors is available. An R package, MRAM, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.
提出了一种回归关联度量,用于从多变量协变量X=(X1,…,Xp)中捕获多变量结果Y=(Y1,…,Yd)的可预测性。在现有测度的激励下,首先将传统的肯德尔τ推广到两个随机向量之间的多变量关联度量。然后,Y对X的可预测性通过Y和Y ‘之间的广义多元肯德尔τ来衡量,其中Y和Y ’具有相同的条件分布,并且在给定X的情况下是条件独立的。所提出的回归关联度量可以表示为Y的一个函数的方差所占的比例,该函数可以被X解释,表明该度量在可预测性方面具有直接的解释。在此基础上,进一步提出了一种条件回归关联测度,利用该测度进行变量选择。由于所提出的度量是基于Y和Y '的,因此可以使用一种简单的基于最近邻的非参数估计方法。一个R包,MRAM,已经开发实现。通过仿真研究来评估所提方法的性能,并对实际数据实例进行了分析。
{"title":"Measuring multivariate regression association via spatial sign","authors":"Jia-Han Shih ,&nbsp;Yi-Hau Chen","doi":"10.1016/j.csda.2025.108288","DOIUrl":"10.1016/j.csda.2025.108288","url":null,"abstract":"<div><div>A regression association measure is proposed for capturing predictability of a multivariate outcome <span><math><mrow><mi>Y</mi><mo>=</mo><mo>(</mo><msub><mi>Y</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>Y</mi><mi>d</mi></msub><mo>)</mo></mrow></math></span> from a multivariate covariate <span><math><mrow><mi>X</mi><mo>=</mo><mo>(</mo><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>X</mi><mi>p</mi></msub><mo>)</mo></mrow></math></span>. Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of <span><math><mi>Y</mi></math></span> from <span><math><mi>X</mi></math></span> is measured by the generalized multivariate Kendall’s tau between <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, where <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span> share the same conditional distribution and are conditionally independent given <span><math><mi>X</mi></math></span>. The proposed regression association measure can be expressed as the proportion of the variance of a function of <span><math><mi>Y</mi></math></span> that can be explained by <span><math><mi>X</mi></math></span>, indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, a simple nonparametric estimation method based on nearest neighbors is available. An R package, <span>MRAM</span>, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108288"},"PeriodicalIF":1.6,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast autoregressive model for multivariate dependent outcomes with application to lipidomics analysis for Alzheimer’s disease and APOE-ε4 多变量依赖结果的快速自回归模型及其在阿尔茨海默病和APOE-ε4脂质组学分析中的应用
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-29 DOI: 10.1016/j.csda.2025.108280
Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen
Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in ε4 allele carriers and non-carriers of the apolipoprotein E (APOE) gene.
多变量组学结果的关联分析具有挑战性,因为结果变量之间具有高维性和相互相关性。在实践中,通常采用经典的多单变量分析方法,对每个结果使用线性回归模型,然后通过控制错误发现率(FDR)或家庭错误率(FWER)来调整多重性。虽然简单,但这些多单变量方法忽略了结果变量之间的依赖关系。这种疏忽导致统计推断不准确,其特点是功率较低,错误发现率增加,最终导致研究的可重复性降低。最近,先进的频率论和贝叶斯方法被开发出来解释这些依赖关系。然而,这些方法通常会给该领域的研究人员带来重大的计算挑战。为了弥补这一差距,提出了一种计算效率高的自回归多元回归模型,该模型明确地考虑了结果变量之间的依赖结构。通过大量的仿真表明,该方法比传统方法提供了更准确的多变量推理,并且即使在模型不规范的情况下仍然具有鲁棒性。此外,该方法还应用于研究载脂蛋白E (APOE)基因的ε4等位基因携带者和非携带者之间的血清脂质组学结果与阿尔茨海默病之间的相关性是否存在差异。
{"title":"Fast autoregressive model for multivariate dependent outcomes with application to lipidomics analysis for Alzheimer’s disease and APOE-ε4","authors":"Hwiyoung Lee ,&nbsp;Zhenyao Ye ,&nbsp;Chixiang Chen ,&nbsp;Peter Kochunov ,&nbsp;L. Elliot Hong ,&nbsp;Shuo Chen","doi":"10.1016/j.csda.2025.108280","DOIUrl":"10.1016/j.csda.2025.108280","url":null,"abstract":"<div><div>Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in <span><math><mrow><mrow><mi>ε</mi></mrow><mn>4</mn></mrow></math></span> allele carriers and non-carriers of the apolipoprotein E (APOE) gene.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108280"},"PeriodicalIF":1.6,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bootstrap-based goodness-of-fit test for parametric families of conditional distributions 条件分布参数族的自举拟合优度检验
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-27 DOI: 10.1016/j.csda.2025.108289
Gitte Kremling, Gerhard Dikta
A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of Y. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.
介绍了分布回归的一致拟合优度检验。检验统计量基于跟踪y的边际分布函数的非参数估计和半参数估计之间的差异的过程。由于其渐近零分布不是无分布的,因此使用参数自举法来确定临界值。经验结果表明,在某些情况下,该测试通过获得更高的功率,从而对假设参数分布族的偏差提供更高的灵敏度,从而优于现有的规范测试。值得注意的是,提议的测试不涉及任何超参数,并且可以使用R中的gofreg-package轻松地应用于单个数据集。
{"title":"Bootstrap-based goodness-of-fit test for parametric families of conditional distributions","authors":"Gitte Kremling,&nbsp;Gerhard Dikta","doi":"10.1016/j.csda.2025.108289","DOIUrl":"10.1016/j.csda.2025.108289","url":null,"abstract":"<div><div>A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of <span><math><mi>Y</mi></math></span>. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108289"},"PeriodicalIF":1.6,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Resampling NANCOVA: Nonparametric analysis of covariance in small samples 重采样NANCOVA:小样本协方差的非参数分析
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-26 DOI: 10.1016/j.csda.2025.108290
Konstantin Emil Thiel , Paavo Sattler , Arne C. Bathke , Georg Zimmermann
Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on relative effects that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).
协方差分析是提高随机试验中因子效应统计检验精度的重要方法。然而,现有的解决方案存在以下一个或多个限制:(i)它们不适合有序数据(作为端点或解释变量);(ii)它们需要半参数模型假设;(iii)由于第一类错误控制往往较差,它们不适用于小数据场景;或者(iv)它们只提供近似测试程序,而(渐近的)精确测试缺失。研究了NANCOVA框架的重采样方法。NANCOVA是一个基于相对效应的完全非参数模型,允许任意数量的协变量和组,其中结果变量(终点)和协变量可以是度量的或有序的。在广泛的模拟中评估了新型NANCOVA检验和无协变量调整的非参数竞争检验。与NANCOVA框架中的近似测试不同,所提出的重采样版本在小样本场景中表现出良好的性能,并且很好地保持了标称的i型误差。重新采样NANCOVA也提供了一致的高功率:在4组和两个协变量的小样本场景中,与没有协变量调整的测试相比,高达26%。此外,重新采样NANCOVA提供了一个渐近精确的测试程序,使其成为目前NANCOVA框架中第一个具有良好有限样本性能的测试程序。总之,重新采样NANCOVA可以被认为是协方差分析克服问题(i) - (iv)的可行工具。
{"title":"Resampling NANCOVA: Nonparametric analysis of covariance in small samples","authors":"Konstantin Emil Thiel ,&nbsp;Paavo Sattler ,&nbsp;Arne C. Bathke ,&nbsp;Georg Zimmermann","doi":"10.1016/j.csda.2025.108290","DOIUrl":"10.1016/j.csda.2025.108290","url":null,"abstract":"<div><div>Analysis of covariance is a crucial method for improving precision of statistical tests for factor effects in randomized experiments. However, existing solutions suffer from one or more of the following limitations: (i) they are not suitable for ordinal data (as endpoints or explanatory variables); (ii) they require semiparametric model assumptions; (iii) they are inapplicable to small data scenarios due to often poor type-I error control; or (iv) they provide only approximate testing procedures and (asymptotically) exact test are missing. A resampling approach to the NANCOVA framework is investigated. NANCOVA is a fully nonparametric model based on <em>relative effects</em> that allows for an arbitrary number of covariates and groups, where both outcome variable (endpoint) and covariates can be metric or ordinal. Novel NANCOVA tests and a nonparametric competitor test without covariate adjustment were evaluated in extensive simulations. Unlike approximate tests in the NANCOVA framework, the proposed resampling version showed good performance in small sample scenarios and maintained the nominal type-I error well. Resampling NANCOVA also provided consistently high power: up to 26 % higher than the test without covariate adjustment in a small sample scenario with 4 groups and two covariates. Moreover, it is shown that resampling NANCOVA provides an asymptotically exact testing procedure, which makes it the first one with good finite sample performance in the present NANCOVA framework. In summary, resampling NANCOVA can be considered a viable tool for analysis of covariance overcoming issues (i) - (iv).</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108290"},"PeriodicalIF":1.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Change-point detection in regression models via the max-EM algorithm 基于max-EM算法的回归模型变点检测
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-24 DOI: 10.1016/j.csda.2025.108278
Modibo Diabaté , Grégory Nuel , Olivier Bouaziz
The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.
断点检测问题是在回归建模框架内考虑的。将约束隐马尔可夫模型与分类- em算法相结合,提出了一种新的分类- em算法。该算法具有线性复杂度,并提供了准确的断点检测和参数估计。理论结果表明,作为回归参数和断点位置的函数,数据的似然在算法的每一步都增加。为解决局部最大值问题,还提出了断点位置的两种初始化方法。最后,给出了单断点情况下的统计检验方法。基于线性、逻辑、泊松和加速失效时间回归模型的仿真实验表明,包含初始化过程和max-EM算法的最终方法在参数估计和断点检测方面都具有较强的性能。对统计检验也进行了评估,并在零假设下显示出正确的拒绝率,在各种替代方案下显示出强大的力量。分析了两个真实数据集,UCI自行车共享数据和健康疾病数据,其中说明了该方法检测数据分布异质性的兴趣。
{"title":"Change-point detection in regression models via the max-EM algorithm","authors":"Modibo Diabaté ,&nbsp;Grégory Nuel ,&nbsp;Olivier Bouaziz","doi":"10.1016/j.csda.2025.108278","DOIUrl":"10.1016/j.csda.2025.108278","url":null,"abstract":"<div><div>The problem of breakpoint detection is considered within a regression modeling framework. A novel method, the max-EM algorithm, is introduced, combining a constrained Hidden Markov Model with the Classification-EM algorithm. This algorithm has linear complexity and provides accurate detection of breakpoints and estimation of parameters. A theoretical result is derived, showing that the likelihood of the data, as a function of the regression parameters and the breakpoints location, increases at each step of the algorithm. Two initialization methods for the breakpoints location are also presented to address local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108278"},"PeriodicalIF":1.6,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast and efficient causal inference in large-scale data via subsampling and projection calibration 基于子采样和投影校准的大规模数据快速有效的因果推理
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-22 DOI: 10.1016/j.csda.2025.108281
Miaomiao Su
Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.
估计大规模数据集的平均处理效果面临着重大的计算和存储挑战。子抽样已成为缓解这些问题的关键策略。本文提出了一种新的基于g估计的子抽样方法,该方法具有双鲁棒性。该方法使用一小部分数据来估计计算复杂的干扰参数,同时利用完整的数据集进行计算简单的最终估计。为了保证得到的估计量对干扰参数的变化保持一阶不敏感,引入了一种投影方法来优化结果回归函数和处理回归函数的估计,使其满足内曼正交条件。结果表明,当正确指定处理模型或结果模型时,所得估计量是渐近正态的,并且与完全基于数据的估计量具有相同的收敛速率。此外,当两个模型都正确指定时,所提出的估计量与完全基于数据的估计量获得相同的渐近方差。通过模拟研究和对出生数据的应用,证明了所提出方法的有限样本性能,这些数据包括过去8年中收集的3000多万次观察结果。数值结果表明,该估计器的计算效率几乎与均匀次抽样估计器相当,而估计效率与基于全数据的g估计器相似。
{"title":"Fast and efficient causal inference in large-scale data via subsampling and projection calibration","authors":"Miaomiao Su","doi":"10.1016/j.csda.2025.108281","DOIUrl":"10.1016/j.csda.2025.108281","url":null,"abstract":"<div><div>Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108281"},"PeriodicalIF":1.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gamma approximation of stratified truncated exact test (GASTE-test) & application 分层截断精确检验(gaste检验)的伽玛近似及其应用
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-21 DOI: 10.1016/j.csda.2025.108277
Alexandre Wendling, Clovis Galiez
The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 × 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at https://github.com/AlexandreWen/gaste. A Python package is available on PyPI at https://pypi.org/project/gaste-test/
二元结果和特征的分析,如疫苗接种对健康的影响,通常依赖于2 × 2列联表。然而,年龄或性别等混杂因素需要通过创建子表进行分层分析,这在生物科学、流行病学和社会研究以及元分析中很常见。传统的跨地层关联性测试方法,如Cochran-Mantel-Haenszel (CMH)测试,难以适应样本量小和地层间影响的异质性。精确的测试可以解决这些问题,但是计算成本很高。为了解决这些挑战,提出了伽玛近似分层截断精确(GASTE)检验。它近似p值与离散支持组合的精确统计量,利用伽马分布近似分层下测试统计量的分布,提供快速和准确的p值计算,即使不同层的影响不同。GASTE方法保持高统计功率和低I型错误率,优于传统方法,提供更敏感和可靠的检测。它计算效率高,扩大了精确测试在分层二值数据研究领域的适用性。GASTE方法通过两个应用得到了证明:高山植物关联的生态学研究和1973年加州大学伯克利分校招生的案例研究。GASTE方法比传统方法有了实质性的改进。GASTE方法可以在https://github.com/AlexandreWen/gaste上以开源包的形式获得。Python包可以在PyPI上获得,网址是https://pypi.org/project/gaste-test/
{"title":"Gamma approximation of stratified truncated exact test (GASTE-test) & application","authors":"Alexandre Wendling,&nbsp;Clovis Galiez","doi":"10.1016/j.csda.2025.108277","DOIUrl":"10.1016/j.csda.2025.108277","url":null,"abstract":"<div><div>The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 <span><math><mo>×</mo></math></span> 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at <span><span>https://github.com/AlexandreWen/gaste</span><svg><path></path></svg></span>. A Python package is available on PyPI at <span><span>https://pypi.org/project/gaste-test/</span><svg><path></path></svg></span></div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108277"},"PeriodicalIF":1.6,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1