首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
A goodness-of-fit test for geometric Brownian motion 几何布朗运动的拟合优度检验
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-23 DOI: 10.1016/j.csda.2025.108196
Daniel Gaigall , Philipp Wübbolding
A new goodness-of-fit test for the composite null hypothesis that data originate from a geometric Brownian motion is studied in the functional data setting. This is equivalent to testing if the data are from a scaled Brownian motion with linear drift. Critical values for the test are obtained, ensuring that the specified significance level is achieved in finite samples. The asymptotic behavior of the test statistic under the null distribution and alternatives is studied, and it is also demonstrated that the test is consistent. Furthermore, the proposed approach offers advantages in terms of fast and simple implementation. A comprehensive simulation study shows that the power of the new test compares favorably to that of existing methods. A key application is the assessment of financial time series for the suitability of the Black-Scholes model. Examples relating to various stock and interest rate time series are presented in order to illustrate the proposed test.
在函数数据集中,研究了数据来源于几何布朗运动的复合零假设的拟合优度检验。这相当于测试数据是否来自线性漂移的布朗运动。获得测试的临界值,确保在有限样本中达到指定的显著性水平。研究了检验统计量在零分布和备选项下的渐近性,并证明了检验量是一致的。此外,所提出的方法在快速和简单的实现方面具有优势。综合仿真研究表明,新方法的性能优于现有方法。一个关键的应用是评估金融时间序列对布莱克-斯科尔斯模型的适用性。为了说明所提出的测试,给出了与各种股票和利率时间序列有关的示例。
{"title":"A goodness-of-fit test for geometric Brownian motion","authors":"Daniel Gaigall ,&nbsp;Philipp Wübbolding","doi":"10.1016/j.csda.2025.108196","DOIUrl":"10.1016/j.csda.2025.108196","url":null,"abstract":"<div><div>A new goodness-of-fit test for the composite null hypothesis that data originate from a geometric Brownian motion is studied in the functional data setting. This is equivalent to testing if the data are from a scaled Brownian motion with linear drift. Critical values for the test are obtained, ensuring that the specified significance level is achieved in finite samples. The asymptotic behavior of the test statistic under the null distribution and alternatives is studied, and it is also demonstrated that the test is consistent. Furthermore, the proposed approach offers advantages in terms of fast and simple implementation. A comprehensive simulation study shows that the power of the new test compares favorably to that of existing methods. A key application is the assessment of financial time series for the suitability of the Black-Scholes model. Examples relating to various stock and interest rate time series are presented in order to illustrate the proposed test.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108196"},"PeriodicalIF":1.5,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simultaneous confidence-bounded true discovery proportion perspective on localizing differences in smooth terms in regression models 回归模型中平滑项的局部化差异的同步置信度有界真发现比例视角
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-23 DOI: 10.1016/j.csda.2025.108197
David Swanson
A method is demonstrated for localizing where two spline terms, or smooths, differ using a true discovery proportion (TDP)-based interpretation. The procedure yields a statement on the proportion of some region where true differences exist between two smooths. The methodology avoids ad hoc approaches to making such statements, like subsetting the data and performing hypothesis tests on the truncated spline terms. TDP estimates are 1-α confidence-bounded simultaneously, which means that a region's TDP estimate is a lower bound on the proportion of actual differences, or true discoveries, in that region, with high confidence regardless of the number of estimates made. The procedure is based on closed-testing using Simes local test. This local test requires that the multivariate χ2 test statistics of generalized Wishart type underlying the method be positive regression dependent on subsets (PRDS), a result for which evidence is presented suggesting that the condition holds. Consistency of the procedure is demonstrated for generalized additive models with the tuning parameter chosen by REML or GCV, and the achievement of confidence-bounded TDP is shown in simulation as is an analysis of walking gait.
演示了一种方法,用于定位两个样条项,或平滑,使用真实发现比例(TDP)为基础的解释不同。这个过程产生一个关于某些区域的比例的陈述,其中两个平滑之间存在真正的差异。该方法避免了对数据进行细分和对截断的样条项进行假设检验等特别的方法。TDP估计值同时具有1-α置信限,这意味着一个地区的TDP估计值是该地区实际差异或真实发现比例的下界,无论估计值的数量如何,都具有高置信度。该程序基于使用Simes本地测试的封闭测试。该局部检验要求该方法基础的广义Wishart类型的多变量χ2检验统计量是依赖于子集的正回归(PRDS),该结果有证据表明该条件成立。对于采用REML或GCV选择的整定参数的广义加性模型,证明了该过程的一致性,并通过仿真和步态分析证明了置信度有界TDP的实现。
{"title":"A simultaneous confidence-bounded true discovery proportion perspective on localizing differences in smooth terms in regression models","authors":"David Swanson","doi":"10.1016/j.csda.2025.108197","DOIUrl":"10.1016/j.csda.2025.108197","url":null,"abstract":"<div><div>A method is demonstrated for localizing where two spline terms, or smooths, differ using a true discovery proportion (TDP)-based interpretation. The procedure yields a statement on the proportion of some region where true differences exist between two smooths. The methodology avoids ad hoc approaches to making such statements, like subsetting the data and performing hypothesis tests on the truncated spline terms. TDP estimates are 1-<em>α</em> confidence-bounded simultaneously, which means that a region's TDP estimate is a lower bound on the proportion of actual differences, or true discoveries, in that region, with high confidence regardless of the number of estimates made. The procedure is based on closed-testing using Simes local test. This local test requires that the multivariate <span><math><msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> test statistics of generalized Wishart type underlying the method be positive regression dependent on subsets (PRDS), a result for which evidence is presented suggesting that the condition holds. Consistency of the procedure is demonstrated for generalized additive models with the tuning parameter chosen by REML or GCV, and the achievement of confidence-bounded TDP is shown in simulation as is an analysis of walking gait.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108197"},"PeriodicalIF":1.5,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143906892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Co-clustering multi-view data using the Latent Block Model 使用潜在块模型的多视图数据共聚类
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-10 DOI: 10.1016/j.csda.2025.108188
Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang
The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.
潜在块模型(Latent Block Model, LBM)是一种突出的基于模型的共聚类方法,它返回每个块聚类的参数表示,并允许使用基础良好的模型选择方法。尽管LBM已经适应了各种特征类型,但它不能应用于由多个不同的特征集(称为视图)组成的数据集,用于一组共同的观察结果。本文引入了多视图LBM,将LBM方法扩展到多视图数据,其中每个视图边缘跟随一个LBM。对于任意一对两个视图,它们之间的依赖关系由行-簇隶属矩阵捕获。基于似然的方法用于参数估计,利用与吉布斯采样器合并的随机EM算法,同时制定了ICL标准来确定每个视图中的行和列簇的数量。为了证明多视图方法的应用是合理的,制定了假设检验来评估跨视图的行簇的独立性,测试过程无缝集成到估计框架中。引入了一种惩罚方案来诱导行聚类的稀疏性。该算法的性能使用合成和现实世界的数据集进行了验证,并提供了最佳参数选择的建议。最后,将多视图共聚类方法应用于复杂的基因组数据集,并为高维多视图问题提供了新的见解。
{"title":"Co-clustering multi-view data using the Latent Block Model","authors":"Joshua Tobin ,&nbsp;Michaela Black ,&nbsp;James Ng ,&nbsp;Debbie Rankin ,&nbsp;Jonathan Wallace ,&nbsp;Catherine Hughes ,&nbsp;Leane Hoey ,&nbsp;Adrian Moore ,&nbsp;Jinling Wang ,&nbsp;Geraldine Horigan ,&nbsp;Paul Carlin ,&nbsp;Helene McNulty ,&nbsp;Anne M. Molloy ,&nbsp;Mimi Zhang","doi":"10.1016/j.csda.2025.108188","DOIUrl":"10.1016/j.csda.2025.108188","url":null,"abstract":"<div><div>The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108188"},"PeriodicalIF":1.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-parametric tests for cross-dependence based on multivariate extensions of ordinal patterns 基于有序模式多元扩展的交叉依赖非参数检验
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-10 DOI: 10.1016/j.csda.2025.108189
Angelika Silbernagel , Christian H. Weiß , Alexander Schnurr
Analyzing the cross-dependence within sequentially observed pairs of random variables is an interesting mathematical problem that also has several practical applications. Most of the time, classical dependence measures like Pearson's correlation are used to this end. This quantity, however, only measures linear dependence and has other drawbacks as well. Different concepts for measuring cross-dependence in sequentially observed random vectors, which are based on so-called ordinal patterns or multivariate generalizations of them, are described. In all cases, limiting distributions of the corresponding test statistics are derived. In a simulation study, the performance of these statistics is compared with three competitors, namely, classical Pearson's and Spearman's correlation as well as the rank-based Chatterjee's correlation coefficient. The applicability of the test statistics is illustrated by using them on two real-world data examples.
分析顺序观察到的随机变量对中的交叉依赖性是一个有趣的数学问题,也有几个实际应用。大多数情况下,经典的相关性测量方法,如皮尔逊相关性,被用于此目的。然而,这个量只能测量线性相关性,并且还有其他缺点。描述了在顺序观察随机向量中测量交叉依赖性的不同概念,这些概念基于所谓的有序模式或它们的多元推广。在所有情况下,推导出相应测试统计量的极限分布。在模拟研究中,将这些统计数据的性能与三个竞争对手进行比较,即经典的Pearson和Spearman相关系数以及基于排名的Chatterjee相关系数。通过在两个实际数据示例中使用测试统计量来说明测试统计量的适用性。
{"title":"Non-parametric tests for cross-dependence based on multivariate extensions of ordinal patterns","authors":"Angelika Silbernagel ,&nbsp;Christian H. Weiß ,&nbsp;Alexander Schnurr","doi":"10.1016/j.csda.2025.108189","DOIUrl":"10.1016/j.csda.2025.108189","url":null,"abstract":"<div><div>Analyzing the cross-dependence within sequentially observed pairs of random variables is an interesting mathematical problem that also has several practical applications. Most of the time, classical dependence measures like Pearson's correlation are used to this end. This quantity, however, only measures linear dependence and has other drawbacks as well. Different concepts for measuring cross-dependence in sequentially observed random vectors, which are based on so-called ordinal patterns or multivariate generalizations of them, are described. In all cases, limiting distributions of the corresponding test statistics are derived. In a simulation study, the performance of these statistics is compared with three competitors, namely, classical Pearson's and Spearman's correlation as well as the rank-based Chatterjee's correlation coefficient. The applicability of the test statistics is illustrated by using them on two real-world data examples.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108189"},"PeriodicalIF":1.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A flexible mixed-membership model for community and enterotype detection for microbiome data 一种灵活的混合成员模型,用于微生物组数据的社区和肠型检测
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-04 DOI: 10.1016/j.csda.2025.108181
Alice Giampino, Roberto Ascari, Sonia Migliorati
Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.
了解人类肠道微生物组如何影响宿主健康是具有挑战性的,因为微生物组数据具有广泛的个体间变异性、稀疏性和高维性。混合隶属模型以前已应用于这些数据,以检测潜在的群落细菌分类群,预计共同发生。应用最广泛的混合隶属度模型是潜狄利克雷分配(LDA)模型。然而,LDA受到施加在群落比例上的Dirichlet分布的刚性的限制,这阻碍了它对依赖关系建模和解释过度分散的能力。为了解决这一限制,提出了LDA的推广,通过结合柔性狄利克雷(FD),一种具有狄利克雷分量的特定可识别混合物,为协方差矩阵引入了更大的灵活性。除了鉴定群落外,新模型还能够检测肠道型,即具有相似微生物组成的样品簇。为了推理的目的,提出了一种计算效率高的折叠吉布斯采样器,它利用了FD分布相对于多项模型的共轭性。仿真研究表明,该模型能够通过最小化真实值与估估值之间的适当成分差异来准确地恢复真实参数值。此外,该模型正确地识别了社区的数量,正如困惑分数所证明的那样。此外,对COMBO数据集的应用突出了它在检测生物学上重要和连贯的群落和肠道型方面的有效性,揭示了群落丰度之间更广泛的相关性。这些结果表明,新模型比LDA有了明显的改进。
{"title":"A flexible mixed-membership model for community and enterotype detection for microbiome data","authors":"Alice Giampino,&nbsp;Roberto Ascari,&nbsp;Sonia Migliorati","doi":"10.1016/j.csda.2025.108181","DOIUrl":"10.1016/j.csda.2025.108181","url":null,"abstract":"<div><div>Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108181"},"PeriodicalIF":1.5,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiply robust estimation of causal effects using linked data 使用关联数据乘以因果效应的稳健估计
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-02 DOI: 10.1016/j.csda.2025.108175
Shanshan Luo , Yechi Zhang , Wei Li , Zhi Geng
Unmeasured confounding presents a common challenge in observational studies, potentially making standard causal parameters unidentifiable without additional assumptions. Given the increasing availability of diverse data sources, exploiting data linkage offers a potential solution to mitigate unmeasured confounding within a primary study of interest. However, this approach often introduces selection bias, as data linkage is feasible only for a subset of the study population. To address such a concern, this paper explores three nonparametric identification strategies assuming that a unit's inclusion in the linked cohort is determined solely by the observed confounders, while acknowledging that the ignorability assumption may depend on some partially unobserved covariates. The existence of multiple identification strategies motivates the development of estimators that effectively capture distinct components of the observed data distribution. Appropriately combining these estimators yields triply robust estimators for the average treatment effect. These estimators remain consistent if at least one of the three distinct parts of the observed data law is correct. Moreover, they are locally efficient if all the models are correctly specified. The proposed estimators are evaluated using simulation studies and real data analysis.
在观察性研究中,无法测量的混杂是一个常见的挑战,如果没有额外的假设,可能会使标准的因果参数无法识别。鉴于各种数据源的可用性日益增加,利用数据链接提供了一种潜在的解决方案,以减轻感兴趣的初级研究中不可测量的混淆。然而,这种方法往往会引入选择偏差,因为数据链接只适用于研究人群的一个子集。为了解决这样的问题,本文探讨了三种非参数识别策略,假设一个单位在相关队列中的包含仅由观察到的混杂因素决定,同时承认可忽略性假设可能依赖于一些部分未观察到的协变量。多种识别策略的存在激发了估计器的发展,这些估计器可以有效地捕获观测数据分布的不同组成部分。适当地结合这些估计量可以得到平均处理效果的三个稳健估计量。如果观测到的数据定律的三个不同部分中至少有一个是正确的,那么这些估计量保持一致。此外,如果正确指定了所有模型,则它们是局部有效的。利用仿真研究和实际数据分析对所提出的估计器进行了评估。
{"title":"Multiply robust estimation of causal effects using linked data","authors":"Shanshan Luo ,&nbsp;Yechi Zhang ,&nbsp;Wei Li ,&nbsp;Zhi Geng","doi":"10.1016/j.csda.2025.108175","DOIUrl":"10.1016/j.csda.2025.108175","url":null,"abstract":"<div><div>Unmeasured confounding presents a common challenge in observational studies, potentially making standard causal parameters unidentifiable without additional assumptions. Given the increasing availability of diverse data sources, exploiting data linkage offers a potential solution to mitigate unmeasured confounding within a primary study of interest. However, this approach often introduces selection bias, as data linkage is feasible only for a subset of the study population. To address such a concern, this paper explores three nonparametric identification strategies assuming that a unit's inclusion in the linked cohort is determined solely by the observed confounders, while acknowledging that the ignorability assumption may depend on some partially unobserved covariates. The existence of multiple identification strategies motivates the development of estimators that effectively capture distinct components of the observed data distribution. Appropriately combining these estimators yields triply robust estimators for the average treatment effect. These estimators remain consistent if at least one of the three distinct parts of the observed data law is correct. Moreover, they are locally efficient if all the models are correctly specified. The proposed estimators are evaluated using simulation studies and real data analysis.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108175"},"PeriodicalIF":1.5,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Eliciting prior information from clinical trials via calibrated Bayes factor 通过校正贝叶斯因子从临床试验中提取先验信息
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-03-31 DOI: 10.1016/j.csda.2025.108180
Roberto Macrì Demartino , Leonardo Egidi , Nicola Torelli , Ioannis Ntzoufras
In the Bayesian framework power prior distributions are increasingly adopted in clinical trials and similar studies to incorporate external and past information, typically to inform the parameter associated with a treatment effect. Their use is particularly effective in scenarios with small sample sizes and where robust prior information is available. A crucial component of this methodology is represented by its weight parameter, which controls the volume of historical information incorporated into the current analysis. Although this parameter can be modeled as either fixed or random, eliciting its prior distribution via a full Bayesian approach remains challenging. In general, this parameter should be carefully selected to accurately reflect the available historical information without dominating the posterior inferential conclusions. A novel simulation-based calibrated Bayes factor procedure is proposed to elicit the prior distribution of the weight parameter, allowing it to be updated according to the strength of the evidence in the data. The goal is to facilitate the integration of historical data when there is agreement with current information and to limit it when discrepancies arise in terms, for instance, of prior-data conflicts. The performance of the proposed method is tested through simulation studies and applied to real data from clinical trials.
在贝叶斯框架中,临床试验和类似研究越来越多地采用功率先验分布,以纳入外部和过去的信息,通常是为了告知与治疗效果相关的参数。它们的使用在样本量小且有可靠的先验信息的情况下特别有效。该方法的一个关键组成部分由其权重参数表示,该参数控制纳入当前分析的历史信息的数量。尽管该参数可以建模为固定或随机,但通过完全贝叶斯方法得出其先验分布仍然具有挑战性。一般情况下,该参数应谨慎选择,以准确反映现有的历史信息,而不占后验推断结论的主导地位。提出了一种新的基于模拟的校准贝叶斯因子过程,以得出权重参数的先验分布,使其能够根据数据中证据的强度进行更新。目标是在与当前信息一致时促进历史数据的整合,并在出现差异时(例如先前的数据冲突)限制历史数据的整合。通过仿真研究和临床试验的真实数据验证了所提出方法的性能。
{"title":"Eliciting prior information from clinical trials via calibrated Bayes factor","authors":"Roberto Macrì Demartino ,&nbsp;Leonardo Egidi ,&nbsp;Nicola Torelli ,&nbsp;Ioannis Ntzoufras","doi":"10.1016/j.csda.2025.108180","DOIUrl":"10.1016/j.csda.2025.108180","url":null,"abstract":"<div><div>In the Bayesian framework power prior distributions are increasingly adopted in clinical trials and similar studies to incorporate external and past information, typically to inform the parameter associated with a treatment effect. Their use is particularly effective in scenarios with small sample sizes and where robust prior information is available. A crucial component of this methodology is represented by its weight parameter, which controls the volume of historical information incorporated into the current analysis. Although this parameter can be modeled as either fixed or random, eliciting its prior distribution via a full Bayesian approach remains challenging. In general, this parameter should be carefully selected to accurately reflect the available historical information without dominating the posterior inferential conclusions. A novel simulation-based calibrated Bayes factor procedure is proposed to elicit the prior distribution of the weight parameter, allowing it to be updated according to the strength of the evidence in the data. The goal is to facilitate the integration of historical data when there is agreement with current information and to limit it when discrepancies arise in terms, for instance, of prior-data conflicts. The performance of the proposed method is tested through simulation studies and applied to real data from clinical trials.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108180"},"PeriodicalIF":1.5,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143746613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discretization: Privacy-preserving data publishing for causal discovery 离散化:隐私保护数据发布的因果发现
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-03-27 DOI: 10.1016/j.csda.2025.108174
Youngmin Ahn , Woongjoon Park , Gunwoong Park
As the importance of data privacy continues to grow, data masking has emerged as a crucial method. Notably, data masking techniques aim to protect individual privacy, while enabling data analysts to derive meaningful statistical results, such as the identification of directional or causal relationships between variables. Hence, this study demonstrates the advantages of a quantile-based discretization for protecting privacy and uncovering the relationships between variables in Gaussian directed acyclic graphical (DAG) models. Specifically, it introduces quantile-discretized Gaussian DAG models where each node variable is discretized based on the quantiles. Additionally, it proposes the bi-partition process, which aids in recovering the covariance matrix; hence, the models can be identifiable. Furthermore, a consistent algorithm is developed for learning the underlying structure using the quantile-based discretized data. Finally, through numerical experiments and the application of DAG learning algorithms to discretized MLB data, the proposed algorithm is demonstrated to significantly outperform the state-of-the-art DAG model learning algorithms.
随着数据隐私的重要性不断提高,数据屏蔽已经成为一种至关重要的方法。值得注意的是,数据屏蔽技术旨在保护个人隐私,同时使数据分析师能够得出有意义的统计结果,例如识别变量之间的方向或因果关系。因此,本研究证明了基于分位数的离散化在保护隐私和揭示高斯有向无环图(DAG)模型中变量之间关系方面的优势。具体来说,它引入了分位数离散化的高斯DAG模型,其中每个节点变量都是基于分位数离散化的。此外,它提出了双分割过程,这有助于恢复协方差矩阵;因此,模型可以被识别。在此基础上,提出了一种基于分位数的离散化数据学习底层结构的一致性算法。最后,通过数值实验和DAG学习算法在离散化MLB数据中的应用,证明了该算法明显优于目前最先进的DAG模型学习算法。
{"title":"Discretization: Privacy-preserving data publishing for causal discovery","authors":"Youngmin Ahn ,&nbsp;Woongjoon Park ,&nbsp;Gunwoong Park","doi":"10.1016/j.csda.2025.108174","DOIUrl":"10.1016/j.csda.2025.108174","url":null,"abstract":"<div><div>As the importance of data privacy continues to grow, data masking has emerged as a crucial method. Notably, data masking techniques aim to protect individual privacy, while enabling data analysts to derive meaningful statistical results, such as the identification of directional or causal relationships between variables. Hence, this study demonstrates the advantages of a quantile-based discretization for protecting privacy and uncovering the relationships between variables in Gaussian directed acyclic graphical (DAG) models. Specifically, it introduces quantile-discretized Gaussian DAG models where each node variable is discretized based on the quantiles. Additionally, it proposes the bi-partition process, which aids in recovering the covariance matrix; hence, the models can be identifiable. Furthermore, a consistent algorithm is developed for learning the underlying structure using the quantile-based discretized data. Finally, through numerical experiments and the application of DAG learning algorithms to discretized MLB data, the proposed algorithm is demonstrated to significantly outperform the state-of-the-art DAG model learning algorithms.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108174"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient regularized estimation of graphical proportional hazards model with interval-censored data 区间截尾数据下图形比例风险模型的有效正则化估计
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-03-27 DOI: 10.1016/j.csda.2025.108178
Huimin Lu , Yilong Wang , Heming Bing , Shuying Wang , Niya Li
Variable selection is discussed in many cases in survival analysis. In particular, the analysis of using proportional hazards (PH) models to deal with censored survival data has established a large amount of literature. Based on interval-censored data, this paper discusses the situation of complex network structures existing in covariates. To address the issue, a more flexible and versatile PH model has been developed by combining probabilistic graphical models with PH models, to describe the correlation between covariates. Based on the block coordinate descent method, a penalized estimation method is proposed, which can simultaneously perform variable selection and parameter estimation. The effectiveness of the proposed model and its parameter estimation method are evaluated through simulation studies and the analysis of clinical trial data related to Alzheimer's disease, confirming the reliability and accuracy of the proposed model and method.
在生存分析中,很多情况下都要讨论变量选择。特别是,使用比例风险(PH)模型处理审查生存数据的分析已经建立了大量的文献。基于区间截尾数据,讨论了协变量中存在复杂网络结构的情况。为了解决这个问题,通过将概率图模型与PH模型相结合,开发了一个更灵活、更通用的PH模型,以描述协变量之间的相关性。在块坐标下降法的基础上,提出了一种同时进行变量选择和参数估计的惩罚估计方法。通过仿真研究和对阿尔茨海默病相关临床试验数据的分析,对所提模型及其参数估计方法的有效性进行了评价,证实了所提模型和方法的可靠性和准确性。
{"title":"Efficient regularized estimation of graphical proportional hazards model with interval-censored data","authors":"Huimin Lu ,&nbsp;Yilong Wang ,&nbsp;Heming Bing ,&nbsp;Shuying Wang ,&nbsp;Niya Li","doi":"10.1016/j.csda.2025.108178","DOIUrl":"10.1016/j.csda.2025.108178","url":null,"abstract":"<div><div>Variable selection is discussed in many cases in survival analysis. In particular, the analysis of using proportional hazards (PH) models to deal with censored survival data has established a large amount of literature. Based on interval-censored data, this paper discusses the situation of complex network structures existing in covariates. To address the issue, a more flexible and versatile PH model has been developed by combining probabilistic graphical models with PH models, to describe the correlation between covariates. Based on the block coordinate descent method, a penalized estimation method is proposed, which can simultaneously perform variable selection and parameter estimation. The effectiveness of the proposed model and its parameter estimation method are evaluated through simulation studies and the analysis of clinical trial data related to Alzheimer's disease, confirming the reliability and accuracy of the proposed model and method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108178"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linear covariance selection model via ℓ1-penalization 基于_1惩罚的线性协方差选择模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-03-27 DOI: 10.1016/j.csda.2025.108176
Kwan-Young Bak , Seongoh Park
This paper presents a study on an 1-penalized covariance regression method. Conventional approaches in high-dimensional covariance estimation often lack the flexibility to integrate external information. As a remedy, we adopt the regression-based covariance modeling framework and introduce a linear covariance selection model (LCSM) to encompass a broader spectrum of covariance structures when covariate information is available. Unlike existing methods, we do not assume that the true covariance matrix can be exactly represented by a linear combination of known basis matrices. Instead, we adopt additional basis matrices for a portion of the covariance patterns not captured by the given bases. To estimate high-dimensional regression coefficients, we exploit the sparsity-inducing 1-penalization scheme. Our theoretical analyses are based on the (symmetric) matrix regression model with additive random error matrix, which allows us to establish new non-asymptotic convergence rates of the proposed covariance estimator. The proposed method is implemented with the coordinate descent algorithm. We conduct empirical evaluation on simulated data to complement theoretical findings and underscore the efficacy of our approach. To show a practical applicability of our method, we further apply it to the co-expression analysis of liver gene expression data where the given basis corresponds to the adjacency matrix of the co-expression network.
本文研究了一种l1惩罚的协方差回归方法。传统的高维协方差估计方法往往缺乏集成外部信息的灵活性。作为补救措施,我们采用基于回归的协方差建模框架,并引入线性协方差选择模型(LCSM),以在协方差信息可用时涵盖更广泛的协方差结构。与现有方法不同,我们不假设真正的协方差矩阵可以由已知基矩阵的线性组合精确表示。相反,我们采用额外的基矩阵来处理未被给定基捕获的部分协方差模式。为了估计高维回归系数,我们利用稀疏性诱导的1-惩罚方案。我们的理论分析是基于具有加性随机误差矩阵的(对称)矩阵回归模型,这使我们能够建立新的协方差估计的非渐近收敛率。该方法采用坐标下降算法实现。我们对模拟数据进行实证评估,以补充理论发现,并强调我们方法的有效性。为了显示我们的方法的实际适用性,我们进一步将其应用于肝脏基因表达数据的共表达分析,其中给定的基对应于共表达网络的邻接矩阵。
{"title":"Linear covariance selection model via ℓ1-penalization","authors":"Kwan-Young Bak ,&nbsp;Seongoh Park","doi":"10.1016/j.csda.2025.108176","DOIUrl":"10.1016/j.csda.2025.108176","url":null,"abstract":"<div><div>This paper presents a study on an <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalized covariance regression method. Conventional approaches in high-dimensional covariance estimation often lack the flexibility to integrate external information. As a remedy, we adopt the regression-based covariance modeling framework and introduce a linear covariance selection model (LCSM) to encompass a broader spectrum of covariance structures when covariate information is available. Unlike existing methods, we do not assume that the true covariance matrix can be exactly represented by a linear combination of known basis matrices. Instead, we adopt additional basis matrices for a portion of the covariance patterns not captured by the given bases. To estimate high-dimensional regression coefficients, we exploit the sparsity-inducing <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalization scheme. Our theoretical analyses are based on the (symmetric) matrix regression model with additive random error matrix, which allows us to establish new non-asymptotic convergence rates of the proposed covariance estimator. The proposed method is implemented with the coordinate descent algorithm. We conduct empirical evaluation on simulated data to complement theoretical findings and underscore the efficacy of our approach. To show a practical applicability of our method, we further apply it to the co-expression analysis of liver gene expression data where the given basis corresponds to the adjacency matrix of the co-expression network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108176"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1