首页 > 最新文献

arXiv - STAT - Methodology最新文献

英文 中文
Statistical Inference for Chi-square Statistics or F-Statistics Based on Multiple Imputation 基于多重估算的卡方统计或 F 统计的统计推断
Pub Date : 2024-09-17 DOI: arxiv-2409.10812
Binhuan Wang, Yixin Fang, Man Jin
Missing data is a common issue in medical, psychiatry, and social studies. Inliterature, Multiple Imputation (MI) was proposed to multiply impute datasetsand combine analysis results from imputed datasets for statistical inferenceusing Rubin's rule. However, Rubin's rule only works for combined inference onstatistical tests with point and variance estimates and is not applicable tocombine general F-statistics or Chi-square statistics. In this manuscript, weprovide a solution to combine F-test statistics from multiply imputed datasets,when the F-statistic has an explicit fractional form (that is, both thenumerator and denominator of the F-statistic are reported). Then we extend themethod to combine Chi-square statistics from multiply imputed datasets.Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVAand Type-III tests of fixed effects in mixed effects models, which do not havethe explicit fractional form. SAS macros are also developed to facilitateapplications.
缺失数据是医学、精神病学和社会研究中的一个常见问题。文献中提出了多重估算(MI)方法,利用鲁宾法则对数据集进行多重估算,并合并估算数据集的分析结果进行统计推断。然而,鲁宾法则只适用于对具有点估计值和方差估计值的统计检验进行合并推断,不适用于合并一般的 F 统计量或卡方统计量。在本手稿中,我们提供了一种解决方案,当 F 统计量具有明确的分数形式(即同时报告 F 统计量的分母和分子)时,可以合并多重归因数据集的 F 检验统计量。此外,我们还开发了两种常用 F 检验方法,即韦尔奇方差分析和混合效应模型中固定效应的第三类检验,这两种检验不具有明确的分数形式。我们还开发了 SAS 宏以方便应用。
{"title":"Statistical Inference for Chi-square Statistics or F-Statistics Based on Multiple Imputation","authors":"Binhuan Wang, Yixin Fang, Man Jin","doi":"arxiv-2409.10812","DOIUrl":"https://doi.org/arxiv-2409.10812","url":null,"abstract":"Missing data is a common issue in medical, psychiatry, and social studies. In\u0000literature, Multiple Imputation (MI) was proposed to multiply impute datasets\u0000and combine analysis results from imputed datasets for statistical inference\u0000using Rubin's rule. However, Rubin's rule only works for combined inference on\u0000statistical tests with point and variance estimates and is not applicable to\u0000combine general F-statistics or Chi-square statistics. In this manuscript, we\u0000provide a solution to combine F-test statistics from multiply imputed datasets,\u0000when the F-statistic has an explicit fractional form (that is, both the\u0000numerator and denominator of the F-statistic are reported). Then we extend the\u0000method to combine Chi-square statistics from multiply imputed datasets.\u0000Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVA\u0000and Type-III tests of fixed effects in mixed effects models, which do not have\u0000the explicit fractional form. SAS macros are also developed to facilitate\u0000applications.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decomposing Gaussians with Unknown Covariance 对具有未知协方差的高斯进行分解
Pub Date : 2024-09-17 DOI: arxiv-2409.11497
Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten
Common workflows in machine learning and statistics rely on the ability topartition the information in a data set into independent portions. Recent workhas shown that this may be possible even when conventional sample splitting isnot (e.g., when the number of samples $n=1$, or when observations are notindependent and identically distributed). However, the approaches that arecurrently available to decompose multivariate Gaussian data require knowledgeof the covariance matrix. In many important problems (such as in spatial orlongitudinal data analysis, and graphical modeling), the covariance matrix maybe unknown and even of primary interest. Thus, in this work we develop newapproaches to decompose Gaussians with unknown covariance. First, we present ageneral algorithm that encompasses all previous decomposition approaches forGaussian data as special cases, and can further handle the case of an unknowncovariance. It yields a new and more flexible alternative to sample splittingwhen $n>1$. When $n=1$, we prove that it is impossible to partition theinformation in a multivariate Gaussian into independent portions withoutknowing the covariance matrix. Thus, we use the general algorithm to decomposea single multivariate Gaussian with unknown covariance into dependent partswith tractable conditional distributions, and demonstrate their use forinference and validation. The proposed decomposition strategy extends naturallyto Gaussian processes. In simulation and on electroencephalography data, weapply these decompositions to the tasks of model selection and post-selectioninference in settings where alternative strategies are unavailable.
机器学习和统计学中的常见工作流程依赖于将数据集中的信息分割成独立部分的能力。最近的研究表明,即使在传统的样本分割方法无法实现的情况下(例如,当样本数 $n=1$ 时,或当观测值不是独立且同分布时),这种方法也是可行的。然而,目前可用来分解多变量高斯数据的方法需要了解协方差矩阵。在许多重要问题中(如空间或纵向数据分析以及图形建模),协方差矩阵可能是未知的,甚至是最重要的。因此,在这项工作中,我们开发了分解具有未知协方差的高斯的新方法。首先,我们提出了一种通用算法,它包含了以往所有高斯数据分解方法的特例,并能进一步处理未知协方差的情况。当 $n>1$ 时,它产生了一种新的、更灵活的样本分割替代方法。当 $n=1$ 时,我们证明不可能在不知道协方差矩阵的情况下将多元高斯中的信息分割成独立的部分。因此,我们使用一般算法将具有未知协方差的单个多元高斯分解为具有可控条件分布的从属部分,并演示了它们在推断和验证中的应用。所提出的分解策略可以自然地扩展到高斯过程。在仿真和脑电图数据中,我们将这些分解应用于模型选择和后选择推断任务,而这些任务是在没有替代策略的情况下完成的。
{"title":"Decomposing Gaussians with Unknown Covariance","authors":"Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten","doi":"arxiv-2409.11497","DOIUrl":"https://doi.org/arxiv-2409.11497","url":null,"abstract":"Common workflows in machine learning and statistics rely on the ability to\u0000partition the information in a data set into independent portions. Recent work\u0000has shown that this may be possible even when conventional sample splitting is\u0000not (e.g., when the number of samples $n=1$, or when observations are not\u0000independent and identically distributed). However, the approaches that are\u0000currently available to decompose multivariate Gaussian data require knowledge\u0000of the covariance matrix. In many important problems (such as in spatial or\u0000longitudinal data analysis, and graphical modeling), the covariance matrix may\u0000be unknown and even of primary interest. Thus, in this work we develop new\u0000approaches to decompose Gaussians with unknown covariance. First, we present a\u0000general algorithm that encompasses all previous decomposition approaches for\u0000Gaussian data as special cases, and can further handle the case of an unknown\u0000covariance. It yields a new and more flexible alternative to sample splitting\u0000when $n>1$. When $n=1$, we prove that it is impossible to partition the\u0000information in a multivariate Gaussian into independent portions without\u0000knowing the covariance matrix. Thus, we use the general algorithm to decompose\u0000a single multivariate Gaussian with unknown covariance into dependent parts\u0000with tractable conditional distributions, and demonstrate their use for\u0000inference and validation. The proposed decomposition strategy extends naturally\u0000to Gaussian processes. In simulation and on electroencephalography data, we\u0000apply these decompositions to the tasks of model selection and post-selection\u0000inference in settings where alternative strategies are unavailable.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretability Indices and Soft Constraints for Factor Models 因子模型的可解释性指数和软约束
Pub Date : 2024-09-17 DOI: arxiv-2409.11525
Justin Philip Tuazon, Gia Mizrane Abubo, Joemari Olea
Factor analysis is a way to characterize the relationships between many(observable) variables in terms of a smaller number of unobservable randomvariables which are called factors. However, the application of factor modelsand its success can be subjective or difficult to gauge, since infinitely manyfactor models that produce the same correlation matrix can be fit given sampledata. Thus, there is a need to operationalize a criterion that measures howmeaningful or "interpretable" a factor model is in order to select the bestamong many factor models. While there are already techniques that aim to measure and enhanceinterpretability, new indices, as well as rotation methods via mathematicaloptimization based on them, are proposed to measure interpretability. Theproposed methods directly incorporate semantics with the help of naturallanguage processing and are generalized to incorporate any "prior information".Moreover, the indices allow for complete or partial specification ofrelationships at a pairwise level. Aside from these, two other main benefits ofthe proposed methods are that they do not require the estimation of factorscores, which avoids the factor score indeterminacy problem, and that noadditional explanatory variables are necessary. The implementation of the proposed methods is written in Python 3 and is madeavailable together with several helper functions through the packageinterpretablefa on the Python Package Index. The methods' application isdemonstrated here using data on the Experiences in Close Relationships Scale,obtained from the Open-Source Psychometrics Project.
因子分析是用较少数量的不可观测随机变量来描述许多(可观测)变量之间关系的一种方法,这些变量被称为因子。然而,因子模型的应用及其成功与否可能是主观的或难以衡量的,因为在给定的抽样数据中,可以拟合出产生相同相关矩阵的无限多个因子模型。因此,需要有一个可操作的标准来衡量因子模型的意义或 "可解释性",以便在众多因子模型中选出最佳模型。虽然目前已经有了一些旨在测量和增强可解释性的技术,但我们还是提出了一些新的指数以及基于这些指数的数学优化旋转方法来测量可解释性。所提出的方法借助自然语言处理技术直接将语义纳入其中,并将其推广到任何 "先验信息 "中。此外,这些指数允许在成对水平上对关系进行完整或部分说明。除此之外,所提方法还有两个主要优点,一是不需要估计因子分数,从而避免了因子分数不确定的问题,二是不需要额外的解释变量。所提方法的实现是用 Python 3 编写的,并通过 Python 软件包索引中的软件包interpretablefa 与几个辅助函数一起提供。本文使用从开源心理测量项目(Open-Source Psychometrics Project)获得的亲密关系体验量表(Experiences in Close Relationships Scale)数据来演示这些方法的应用。
{"title":"Interpretability Indices and Soft Constraints for Factor Models","authors":"Justin Philip Tuazon, Gia Mizrane Abubo, Joemari Olea","doi":"arxiv-2409.11525","DOIUrl":"https://doi.org/arxiv-2409.11525","url":null,"abstract":"Factor analysis is a way to characterize the relationships between many\u0000(observable) variables in terms of a smaller number of unobservable random\u0000variables which are called factors. However, the application of factor models\u0000and its success can be subjective or difficult to gauge, since infinitely many\u0000factor models that produce the same correlation matrix can be fit given sample\u0000data. Thus, there is a need to operationalize a criterion that measures how\u0000meaningful or \"interpretable\" a factor model is in order to select the best\u0000among many factor models. While there are already techniques that aim to measure and enhance\u0000interpretability, new indices, as well as rotation methods via mathematical\u0000optimization based on them, are proposed to measure interpretability. The\u0000proposed methods directly incorporate semantics with the help of natural\u0000language processing and are generalized to incorporate any \"prior information\".\u0000Moreover, the indices allow for complete or partial specification of\u0000relationships at a pairwise level. Aside from these, two other main benefits of\u0000the proposed methods are that they do not require the estimation of factor\u0000scores, which avoids the factor score indeterminacy problem, and that no\u0000additional explanatory variables are necessary. The implementation of the proposed methods is written in Python 3 and is made\u0000available together with several helper functions through the package\u0000interpretablefa on the Python Package Index. The methods' application is\u0000demonstrated here using data on the Experiences in Close Relationships Scale,\u0000obtained from the Open-Source Psychometrics Project.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation and imputation of missing data in longitudinal models with Zero-Inflated Poisson response variable 零膨胀泊松响应变量纵向模型中缺失数据的估计和估算
Pub Date : 2024-09-17 DOI: arxiv-2409.11040
D. S. Martinez-Lobo, O. O. Melo, N. A. Cruz
This research deals with the estimation and imputation of missing data inlongitudinal models with a Poisson response variable inflated with zeros. Amethodology is proposed that is based on the use of maximum likelihood,assuming that data is missing at random and that there is a correlation betweenthe response variables. In each of the times, the expectation maximization (EM)algorithm is used: in step E, a weighted regression is carried out, conditionedon the previous times that are taken as covariates. In step M, the estimationand imputation of the missing data are performed. The good performance of themethodology in different loss scenarios is demonstrated in a simulation studycomparing the model only with complete data, and estimating missing data usingthe mode of the data of each individual. Furthermore, in a study related to thegrowth of corn, it is tested on real data to develop the algorithm in apractical scenario.
本研究探讨了在纵向模型中,对带有零填充的泊松响应变量的缺失数据进行估计和估算的问题。研究提出了一种基于最大似然法的方法,假设数据是随机缺失的,且响应变量之间存在相关性。在每个时间段,都使用期望最大化(EM)算法:在步骤 E 中,以作为协变量的前几个时间段为条件,进行加权回归。在步骤 M 中,对缺失数据进行估计和估算。在一项模拟研究中,仅使用完整数据对模型进行了比较,并使用每个个体的数据模式对缺失数据进行了估计,结果表明该方法在不同的损失情况下具有良好的性能。此外,在一项与玉米生长相关的研究中,对真实数据进行了测试,以便在实际场景中开发算法。
{"title":"Estimation and imputation of missing data in longitudinal models with Zero-Inflated Poisson response variable","authors":"D. S. Martinez-Lobo, O. O. Melo, N. A. Cruz","doi":"arxiv-2409.11040","DOIUrl":"https://doi.org/arxiv-2409.11040","url":null,"abstract":"This research deals with the estimation and imputation of missing data in\u0000longitudinal models with a Poisson response variable inflated with zeros. A\u0000methodology is proposed that is based on the use of maximum likelihood,\u0000assuming that data is missing at random and that there is a correlation between\u0000the response variables. In each of the times, the expectation maximization (EM)\u0000algorithm is used: in step E, a weighted regression is carried out, conditioned\u0000on the previous times that are taken as covariates. In step M, the estimation\u0000and imputation of the missing data are performed. The good performance of the\u0000methodology in different loss scenarios is demonstrated in a simulation study\u0000comparing the model only with complete data, and estimating missing data using\u0000the mode of the data of each individual. Furthermore, in a study related to the\u0000growth of corn, it is tested on real data to develop the algorithm in a\u0000practical scenario.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"203 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Probability-scale residuals for event-time data 事件时间数据的概率尺度残差
Pub Date : 2024-09-17 DOI: arxiv-2409.11385
Eric S. Kawaguchi, Bryan E. Shepherd, Chun Li
The probability-scale residual (PSR) is defined as $E{sign(y, Y^*)}$, where$y$ is the observed outcome and $Y^*$ is a random variable from the fitteddistribution. The PSR is particularly useful for ordinal and censored outcomesfor which fitted values are not available without additional assumptions.Previous work has defined the PSR for continuous, binary, ordinal,right-censored, and current status outcomes; however, development of the PSRhas not yet been considered for data subject to general interval censoring. Wedevelop extensions of the PSR, first to mixed-case interval-censored data, andthen to data subject to several types of common censoring schemes. We derivethe statistical properties of the PSR and show that our more general PSRencompasses several previously defined PSR for continuous and censored outcomesas special cases. The performance of the residual is illustrated in real datafrom the Caribbean, Central, and South American Network for HIV Epidemiology.
概率标度残差(PSR)定义为 $E{sign(y,Y^*)}$,其中$y$为观测结果,$Y^*$为拟合分布中的随机变量。以前的工作已经定义了连续、二元、序数、右删减和当前状态结果的 PSR;但是,对于一般区间删减的数据,尚未考虑开发 PSR。我们对 PSR 进行了扩展,首先适用于混合情况下的区间删失数据,然后适用于几种常见删失方案下的数据。我们推导出了 PSR 的统计特性,并表明我们更通用的 PSR 包含了之前定义的几种用于连续和剔除结果的 PSR 作为特例。来自加勒比、中美洲和南美洲艾滋病流行病学网络的真实数据说明了残差的性能。
{"title":"Probability-scale residuals for event-time data","authors":"Eric S. Kawaguchi, Bryan E. Shepherd, Chun Li","doi":"arxiv-2409.11385","DOIUrl":"https://doi.org/arxiv-2409.11385","url":null,"abstract":"The probability-scale residual (PSR) is defined as $E{sign(y, Y^*)}$, where\u0000$y$ is the observed outcome and $Y^*$ is a random variable from the fitted\u0000distribution. The PSR is particularly useful for ordinal and censored outcomes\u0000for which fitted values are not available without additional assumptions.\u0000Previous work has defined the PSR for continuous, binary, ordinal,\u0000right-censored, and current status outcomes; however, development of the PSR\u0000has not yet been considered for data subject to general interval censoring. We\u0000develop extensions of the PSR, first to mixed-case interval-censored data, and\u0000then to data subject to several types of common censoring schemes. We derive\u0000the statistical properties of the PSR and show that our more general PSR\u0000encompasses several previously defined PSR for continuous and censored outcomes\u0000as special cases. The performance of the residual is illustrated in real data\u0000from the Caribbean, Central, and South American Network for HIV Epidemiology.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BMRMM: An R Package for Bayesian Markov (Renewal) Mixed Models BMRMM:贝叶斯马尔可夫(更新)混合模型 R 软件包
Pub Date : 2024-09-17 DOI: arxiv-2409.10835
Yutong Wu, Abhra Sarkar
We introduce the BMRMM package implementing Bayesian inference for a class ofMarkov renewal mixed models which can characterize the stochastic dynamics of acollection of sequences, each comprising alternative instances of categoricalstates and associated continuous duration times, while being influenced by aset of exogenous factors as well as a 'random' individual. The default settingflexibly models the state transition probabilities using mixtures of Dirichletdistributions and the duration times using mixtures of gamma kernels while alsoallowing variable selection for both. Modeling such data using simpler Markovmixed models also remains an option, either by ignoring the duration timesaltogether or by replacing them with instances of an additional categoryobtained by discretizing them by a user-specified unit. The option is alsouseful when data on duration times may not be available in the first place. Wedemonstrate the package's utility using two data sets.
我们介绍了 BMRMM 软件包,它实现了一类马尔可夫更新混合模型的贝叶斯推理,可以描述序列集合的随机动态,每个序列由分类状态的备选实例和相关的连续持续时间组成,同时受到一系列外生因素和 "随机 "个体的影响。默认设置可灵活地使用狄利克特分布混合物对状态转换概率建模,使用伽马核混合物对持续时间建模,同时还允许对两者进行变量选择。使用更简单的马尔可夫混合模型对此类数据建模也是一种选择,要么完全忽略持续时间,要么用用户指定单位离散的额外类别实例来代替持续时间。当可能无法获得持续时间数据时,该选项也很有用。我们用两个数据集来演示软件包的实用性。
{"title":"BMRMM: An R Package for Bayesian Markov (Renewal) Mixed Models","authors":"Yutong Wu, Abhra Sarkar","doi":"arxiv-2409.10835","DOIUrl":"https://doi.org/arxiv-2409.10835","url":null,"abstract":"We introduce the BMRMM package implementing Bayesian inference for a class of\u0000Markov renewal mixed models which can characterize the stochastic dynamics of a\u0000collection of sequences, each comprising alternative instances of categorical\u0000states and associated continuous duration times, while being influenced by a\u0000set of exogenous factors as well as a 'random' individual. The default setting\u0000flexibly models the state transition probabilities using mixtures of Dirichlet\u0000distributions and the duration times using mixtures of gamma kernels while also\u0000allowing variable selection for both. Modeling such data using simpler Markov\u0000mixed models also remains an option, either by ignoring the duration times\u0000altogether or by replacing them with instances of an additional category\u0000obtained by discretizing them by a user-specified unit. The option is also\u0000useful when data on duration times may not be available in the first place. We\u0000demonstrate the package's utility using two data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of Cross-Validated Targeted Maximum Likelihood Estimation 交叉验证目标最大似然估计的性能
Pub Date : 2024-09-17 DOI: arxiv-2409.11265
Matthew J. Smith, Rachael V. Phillips, Camille Maringe, Miguel Angel Luque Fernandez
Background: Advanced methods for causal inference, such as targeted maximumlikelihood estimation (TMLE), require certain conditions for statisticalinference. However, in situations where there is not differentiability due todata sparsity or near-positivity violations, the Donsker class condition isviolated. In such situations, TMLE variance can suffer from inflation of thetype I error and poor coverage, leading to conservative confidence intervals.Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improveon performance compared to TMLE in settings of positivity or Donsker classviolations. We aim to investigate the performance of CVTMLE compared to TMLE invarious settings. Methods: We utilised the data-generating mechanism as described in Leger etal. (2022) to run a Monte Carlo experiment under different Donsker classviolations. Then, we evaluated the respective statistical performances of TMLEand CVTMLE with different super learner libraries, with and without regressiontree methods. Results: We found that CVTMLE vastly improves confidence interval coveragewithout adversely affecting bias, particularly in settings with small samplesizes and near-positivity violations. Furthermore, incorporating regressiontrees using standard TMLE with ensemble super learner-based initial estimatesincreases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker classcondition is no longer necessary to obtain valid statistical inference whenusing regression trees and under either data sparsity or near-positivityviolations. We show through simulations that CVTMLE is much less sensitive tothe choice of the super learner library and thereby provides better estimationand inference in cases where the super learner library uses more flexiblecandidates and is prone to overfitting.
背景:先进的因果推断方法,如目标最大似然估计(TMLE),需要一定的统计推断条件。然而,在由于数据稀疏性或近正违反而不存在可分性的情况下,Donsker 类条件就会被违反。在这种情况下,TMLE 方差可能会出现 I 类误差膨胀和覆盖率低的问题,从而导致保守的置信区间。有人建议对 TMLE 算法(CVTMLE)进行交叉验证,以改善在正向性或违反 Donsker 类条件的情况下 TMLE 的性能。我们的目的是研究 CVTMLE 与 TMLE 相比在各种情况下的性能。方法:我们利用 Leger etal.(2022)中所述的数据生成机制,在不同的 Donsker 类别暴力下运行蒙特卡罗实验。然后,我们评估了 TMLE 和 CVTMLE 与不同超级学习器库、回归树方法和非回归树方法各自的统计性能。结果:我们发现,CVTMLE 极大地提高了置信区间的覆盖率,而不会对偏差产生不利影响,尤其是在样本量较小且接近正向违规的情况下。此外,使用标准 TMLE 结合基于集合超级学习器的初始估计的回归树会增加偏差和方差,导致无效的统计推断。结论研究表明,使用 CVTMLE 时,在数据稀疏性或接近正向违反情况下,使用回归树时不再需要 Donsker 类条件来获得有效的统计推断。我们通过仿真表明,CVTMLE 对超级学习库的选择不那么敏感,因此在超级学习库使用更灵活的候选者和容易过度拟合的情况下,CVTMLE 可以提供更好的估计和推断。
{"title":"Performance of Cross-Validated Targeted Maximum Likelihood Estimation","authors":"Matthew J. Smith, Rachael V. Phillips, Camille Maringe, Miguel Angel Luque Fernandez","doi":"arxiv-2409.11265","DOIUrl":"https://doi.org/arxiv-2409.11265","url":null,"abstract":"Background: Advanced methods for causal inference, such as targeted maximum\u0000likelihood estimation (TMLE), require certain conditions for statistical\u0000inference. However, in situations where there is not differentiability due to\u0000data sparsity or near-positivity violations, the Donsker class condition is\u0000violated. In such situations, TMLE variance can suffer from inflation of the\u0000type I error and poor coverage, leading to conservative confidence intervals.\u0000Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve\u0000on performance compared to TMLE in settings of positivity or Donsker class\u0000violations. We aim to investigate the performance of CVTMLE compared to TMLE in\u0000various settings. Methods: We utilised the data-generating mechanism as described in Leger et\u0000al. (2022) to run a Monte Carlo experiment under different Donsker class\u0000violations. Then, we evaluated the respective statistical performances of TMLE\u0000and CVTMLE with different super learner libraries, with and without regression\u0000tree methods. Results: We found that CVTMLE vastly improves confidence interval coverage\u0000without adversely affecting bias, particularly in settings with small sample\u0000sizes and near-positivity violations. Furthermore, incorporating regression\u0000trees using standard TMLE with ensemble super learner-based initial estimates\u0000increases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker class\u0000condition is no longer necessary to obtain valid statistical inference when\u0000using regression trees and under either data sparsity or near-positivity\u0000violations. We show through simulations that CVTMLE is much less sensitive to\u0000the choice of the super learner library and thereby provides better estimation\u0000and inference in cases where the super learner library uses more flexible\u0000candidates and is prone to overfitting.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible survival regression with variable selection for heterogeneous population 针对异质性人群的灵活生存回归与变量选择
Pub Date : 2024-09-16 DOI: arxiv-2409.10771
Abhishek Mandal, Abhisek Chakraborty
Survival regression is widely used to model time-to-events data, to explorehow covariates may influence the occurrence of events. Modern datasets oftenencompass a vast number of covariates across many subjects, with only a subsetof the covariates significantly affecting survival. Additionally, subjectsoften belong to an unknown number of latent groups, where covariate effects onsurvival differ significantly across groups. The proposed methodology addressesboth challenges by simultaneously identifying the latent sub-groups in theheterogeneous population and evaluating covariate significance within eachsub-group. This approach is shown to enhance the predictive accuracy fortime-to-event outcomes, via uncovering varying risk profiles within theunderlying heterogeneous population and is thereby helpful to device targeteddisease management strategies.
生存回归被广泛用于建立时间到事件数据模型,以探索协变量如何影响事件的发生。现代数据集通常包含许多受试者的大量协变量,但只有部分协变量会显著影响存活率。此外,受试者往往属于未知数量的潜在群体,不同群体的协变量对存活率的影响差异很大。所提出的方法通过同时识别异质性人群中的潜在亚组并评估每个亚组内协变因素的显著性来解决这两个难题。研究表明,这种方法通过发现潜在异质性人群中的不同风险特征,提高了对时间到事件结果的预测准确性,从而有助于制定有针对性的疾病管理策略。
{"title":"Flexible survival regression with variable selection for heterogeneous population","authors":"Abhishek Mandal, Abhisek Chakraborty","doi":"arxiv-2409.10771","DOIUrl":"https://doi.org/arxiv-2409.10771","url":null,"abstract":"Survival regression is widely used to model time-to-events data, to explore\u0000how covariates may influence the occurrence of events. Modern datasets often\u0000encompass a vast number of covariates across many subjects, with only a subset\u0000of the covariates significantly affecting survival. Additionally, subjects\u0000often belong to an unknown number of latent groups, where covariate effects on\u0000survival differ significantly across groups. The proposed methodology addresses\u0000both challenges by simultaneously identifying the latent sub-groups in the\u0000heterogeneous population and evaluating covariate significance within each\u0000sub-group. This approach is shown to enhance the predictive accuracy for\u0000time-to-event outcomes, via uncovering varying risk profiles within the\u0000underlying heterogeneous population and is thereby helpful to device targeted\u0000disease management strategies.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in R bayesCureRateModel:用 R 对事件发生时间数据进行贝叶斯治愈率建模
Pub Date : 2024-09-16 DOI: arxiv-2409.10221
Panagiotis Papastamoulis, Fotios Milienos
The family of cure models provides a unique opportunity to simultaneouslymodel both the proportion of cured subjects (those not facing the event ofinterest) and the distribution function of time-to-event for susceptibles(those facing the event). In practice, the application of cure models is mainlyfacilitated by the availability of various R packages. However, most of thesepackages primarily focus on the mixture or promotion time cure rate model. Thisarticle presents a fully Bayesian approach implemented in R to estimate ageneral family of cure rate models in the presence of covariates. It buildsupon the work by Papastamoulis and Milienos (2024) by additionally consideringvarious options for describing the promotion time, including the Weibull,exponential, Gompertz, log-logistic and finite mixtures of gamma distributions,among others. Moreover, the user can choose any proper distribution functionfor modeling the promotion time (provided that some specific conditions aremet). Posterior inference is carried out by constructing a Metropolis-coupledMarkov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for thelatent cure indicators and Metropolis-Hastings steps with Langevin diffusiondynamics for parameter updates. The main MCMC algorithm is embedded within aparallel tempering scheme by considering heated versions of the targetposterior distribution. The package is illustrated on a real dataset analyzingthe duration of the first marriage under the presence of various covariatessuch as the race, age and the presence of kids.
治愈模型系列提供了一个独特的机会,可以同时模拟治愈受试者(不面临相关事件的受试 者)的比例和易感者(面临事件的受试者)的事件发生时间分布函数。实际上,治愈模型的应用主要得益于各种 R 软件包。不过,这些软件包大多主要关注混合模型或推广时间治愈率模型。本文介绍了一种用 R 实现的完全贝叶斯方法,用于估计存在协变量的一般治愈率模型系列。它以 Papastamoulis 和 Milienos(2024 年)的研究为基础,额外考虑了描述促进时间的各种选项,包括 Weibull、指数、Gompertz、logistic 和伽马分布的有限混合物等。此外,用户还可以选择任何适当的分布函数来模拟推广时间(前提是满足某些特定条件)。后验推断是通过构建一个 Metropolis 耦合马尔科夫链蒙特卡罗(MCMC)采样器来实现的,该采样器结合了吉布斯采样(Gibbs sampling)和 Metropolis-Hastings 步骤(Metropolis-Hastings steps),并采用朗文扩散动力学(Langevin diffusiondynamics)进行参数更新。通过考虑目标后验分布的加热版本,主要的 MCMC 算法被嵌入到平行调节方案中。该软件包在一个真实数据集上进行了说明,该数据集分析了在种族、年龄和是否有孩子等各种协变量存在的情况下初婚的持续时间。
{"title":"bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in R","authors":"Panagiotis Papastamoulis, Fotios Milienos","doi":"arxiv-2409.10221","DOIUrl":"https://doi.org/arxiv-2409.10221","url":null,"abstract":"The family of cure models provides a unique opportunity to simultaneously\u0000model both the proportion of cured subjects (those not facing the event of\u0000interest) and the distribution function of time-to-event for susceptibles\u0000(those facing the event). In practice, the application of cure models is mainly\u0000facilitated by the availability of various R packages. However, most of these\u0000packages primarily focus on the mixture or promotion time cure rate model. This\u0000article presents a fully Bayesian approach implemented in R to estimate a\u0000general family of cure rate models in the presence of covariates. It builds\u0000upon the work by Papastamoulis and Milienos (2024) by additionally considering\u0000various options for describing the promotion time, including the Weibull,\u0000exponential, Gompertz, log-logistic and finite mixtures of gamma distributions,\u0000among others. Moreover, the user can choose any proper distribution function\u0000for modeling the promotion time (provided that some specific conditions are\u0000met). Posterior inference is carried out by constructing a Metropolis-coupled\u0000Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the\u0000latent cure indicators and Metropolis-Hastings steps with Langevin diffusion\u0000dynamics for parameter updates. The main MCMC algorithm is embedded within a\u0000parallel tempering scheme by considering heated versions of the target\u0000posterior distribution. The package is illustrated on a real dataset analyzing\u0000the duration of the first marriage under the presence of various covariates\u0000such as the race, age and the presence of kids.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"183 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Matrix Factor Model 广义矩阵因子模型
Pub Date : 2024-09-16 DOI: arxiv-2409.10001
Xinbing Kong, Tong Zhang
This article introduces a nonlinear generalized matrix factor model (GMFM)that allows for mixed-type variables, extending the scope of linear matrixfactor models (LMFM) that are so far limited to handling continuous variables.We introduce a novel augmented Lagrange multiplier method, equivalent to theconstraint maximum likelihood estimation, and carefully tailored to be locallyconcave around the true factor and loading parameters. This statisticallyguarantees the local convexity of the negative Hessian matrix around the trueparameters of the factors and loadings, which is nontrivial in the matrixfactor modeling and leads to feasible central limit theorems of the estimatedfactors and loadings. We also theoretically establish the convergence rates ofthe estimated factor and loading matrices for the GMFM under general conditionsthat allow for correlations across samples, rows, and columns. Moreover, weprovide a model selection criterion to determine the numbers of row and columnfactors consistently. To numerically compute the constraint maximum likelihoodestimator, we provide two algorithms: two-stage alternating maximization andminorization maximization. Extensive simulation studies demonstrate GMFM'ssuperiority in handling discrete and mixed-type variables. An empirical dataanalysis of the company's operating performance shows that GMFM does clusteringand reconstruction well in the presence of discontinuous entries in the datamatrix.
本文介绍了一种允许混合型变量的非线性广义矩阵因子模型(GMFM),扩展了迄今为止仅限于处理连续变量的线性矩阵因子模型(LMFM)的范围。我们引入了一种新颖的增强拉格朗日乘数方法,该方法等同于约束最大似然估计,并经过精心定制,在真实因子和载荷参数周围具有局部凹性。这在统计学上保证了负黑森矩阵在因子和载荷真实参数周围的局部凸性,这在矩阵因子建模中并非难事,并导致了可行的因子和载荷估计中心极限定理。我们还从理论上确定了 GMFM 在允许跨样本、跨行和跨列相关性的一般条件下估计因子和载荷矩阵的收敛率。此外,我们还提供了一个模型选择标准,以一致地确定行和列因子的数量。为了对约束最大似然估计器进行数值计算,我们提供了两种算法:两阶段交替最大化算法和最小化最大化算法。广泛的模拟研究证明了 GMFM 在处理离散变量和混合型变量方面的优越性。对公司经营业绩的实证数据分析表明,在数据矩阵中存在不连续条目的情况下,GMFM 能很好地进行聚类和重构。
{"title":"Generalized Matrix Factor Model","authors":"Xinbing Kong, Tong Zhang","doi":"arxiv-2409.10001","DOIUrl":"https://doi.org/arxiv-2409.10001","url":null,"abstract":"This article introduces a nonlinear generalized matrix factor model (GMFM)\u0000that allows for mixed-type variables, extending the scope of linear matrix\u0000factor models (LMFM) that are so far limited to handling continuous variables.\u0000We introduce a novel augmented Lagrange multiplier method, equivalent to the\u0000constraint maximum likelihood estimation, and carefully tailored to be locally\u0000concave around the true factor and loading parameters. This statistically\u0000guarantees the local convexity of the negative Hessian matrix around the true\u0000parameters of the factors and loadings, which is nontrivial in the matrix\u0000factor modeling and leads to feasible central limit theorems of the estimated\u0000factors and loadings. We also theoretically establish the convergence rates of\u0000the estimated factor and loading matrices for the GMFM under general conditions\u0000that allow for correlations across samples, rows, and columns. Moreover, we\u0000provide a model selection criterion to determine the numbers of row and column\u0000factors consistently. To numerically compute the constraint maximum likelihood\u0000estimator, we provide two algorithms: two-stage alternating maximization and\u0000minorization maximization. Extensive simulation studies demonstrate GMFM's\u0000superiority in handling discrete and mixed-type variables. An empirical data\u0000analysis of the company's operating performance shows that GMFM does clustering\u0000and reconstruction well in the presence of discontinuous entries in the data\u0000matrix.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - STAT - Methodology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1