Missing data is a common issue in medical, psychiatry, and social studies. In literature, Multiple Imputation (MI) was proposed to multiply impute datasets and combine analysis results from imputed datasets for statistical inference using Rubin's rule. However, Rubin's rule only works for combined inference on statistical tests with point and variance estimates and is not applicable to combine general F-statistics or Chi-square statistics. In this manuscript, we provide a solution to combine F-test statistics from multiply imputed datasets, when the F-statistic has an explicit fractional form (that is, both the numerator and denominator of the F-statistic are reported). Then we extend the method to combine Chi-square statistics from multiply imputed datasets. Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVA and Type-III tests of fixed effects in mixed effects models, which do not have the explicit fractional form. SAS macros are also developed to facilitate applications.
缺失数据是医学、精神病学和社会研究中的一个常见问题。文献中提出了多重估算(MI)方法,利用鲁宾法则对数据集进行多重估算,并合并估算数据集的分析结果进行统计推断。然而,鲁宾法则只适用于对具有点估计值和方差估计值的统计检验进行合并推断,不适用于合并一般的 F 统计量或卡方统计量。在本手稿中,我们提供了一种解决方案,当 F 统计量具有明确的分数形式(即同时报告 F 统计量的分母和分子)时,可以合并多重归因数据集的 F 检验统计量。此外,我们还开发了两种常用 F 检验方法,即韦尔奇方差分析和混合效应模型中固定效应的第三类检验,这两种检验不具有明确的分数形式。我们还开发了 SAS 宏以方便应用。
{"title":"Statistical Inference for Chi-square Statistics or F-Statistics Based on Multiple Imputation","authors":"Binhuan Wang, Yixin Fang, Man Jin","doi":"arxiv-2409.10812","DOIUrl":"https://doi.org/arxiv-2409.10812","url":null,"abstract":"Missing data is a common issue in medical, psychiatry, and social studies. In\u0000literature, Multiple Imputation (MI) was proposed to multiply impute datasets\u0000and combine analysis results from imputed datasets for statistical inference\u0000using Rubin's rule. However, Rubin's rule only works for combined inference on\u0000statistical tests with point and variance estimates and is not applicable to\u0000combine general F-statistics or Chi-square statistics. In this manuscript, we\u0000provide a solution to combine F-test statistics from multiply imputed datasets,\u0000when the F-statistic has an explicit fractional form (that is, both the\u0000numerator and denominator of the F-statistic are reported). Then we extend the\u0000method to combine Chi-square statistics from multiply imputed datasets.\u0000Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVA\u0000and Type-III tests of fixed effects in mixed effects models, which do not have\u0000the explicit fractional form. SAS macros are also developed to facilitate\u0000applications.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten
Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently available to decompose multivariate Gaussian data require knowledge of the covariance matrix. In many important problems (such as in spatial or longitudinal data analysis, and graphical modeling), the covariance matrix may be unknown and even of primary interest. Thus, in this work we develop new approaches to decompose Gaussians with unknown covariance. First, we present a general algorithm that encompasses all previous decomposition approaches for Gaussian data as special cases, and can further handle the case of an unknown covariance. It yields a new and more flexible alternative to sample splitting when $n>1$. When $n=1$, we prove that it is impossible to partition the information in a multivariate Gaussian into independent portions without knowing the covariance matrix. Thus, we use the general algorithm to decompose a single multivariate Gaussian with unknown covariance into dependent parts with tractable conditional distributions, and demonstrate their use for inference and validation. The proposed decomposition strategy extends naturally to Gaussian processes. In simulation and on electroencephalography data, we apply these decompositions to the tasks of model selection and post-selection inference in settings where alternative strategies are unavailable.
{"title":"Decomposing Gaussians with Unknown Covariance","authors":"Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten","doi":"arxiv-2409.11497","DOIUrl":"https://doi.org/arxiv-2409.11497","url":null,"abstract":"Common workflows in machine learning and statistics rely on the ability to\u0000partition the information in a data set into independent portions. Recent work\u0000has shown that this may be possible even when conventional sample splitting is\u0000not (e.g., when the number of samples $n=1$, or when observations are not\u0000independent and identically distributed). However, the approaches that are\u0000currently available to decompose multivariate Gaussian data require knowledge\u0000of the covariance matrix. In many important problems (such as in spatial or\u0000longitudinal data analysis, and graphical modeling), the covariance matrix may\u0000be unknown and even of primary interest. Thus, in this work we develop new\u0000approaches to decompose Gaussians with unknown covariance. First, we present a\u0000general algorithm that encompasses all previous decomposition approaches for\u0000Gaussian data as special cases, and can further handle the case of an unknown\u0000covariance. It yields a new and more flexible alternative to sample splitting\u0000when $n>1$. When $n=1$, we prove that it is impossible to partition the\u0000information in a multivariate Gaussian into independent portions without\u0000knowing the covariance matrix. Thus, we use the general algorithm to decompose\u0000a single multivariate Gaussian with unknown covariance into dependent parts\u0000with tractable conditional distributions, and demonstrate their use for\u0000inference and validation. The proposed decomposition strategy extends naturally\u0000to Gaussian processes. In simulation and on electroencephalography data, we\u0000apply these decompositions to the tasks of model selection and post-selection\u0000inference in settings where alternative strategies are unavailable.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Justin Philip Tuazon, Gia Mizrane Abubo, Joemari Olea
Factor analysis is a way to characterize the relationships between many (observable) variables in terms of a smaller number of unobservable random variables which are called factors. However, the application of factor models and its success can be subjective or difficult to gauge, since infinitely many factor models that produce the same correlation matrix can be fit given sample data. Thus, there is a need to operationalize a criterion that measures how meaningful or "interpretable" a factor model is in order to select the best among many factor models. While there are already techniques that aim to measure and enhance interpretability, new indices, as well as rotation methods via mathematical optimization based on them, are proposed to measure interpretability. The proposed methods directly incorporate semantics with the help of natural language processing and are generalized to incorporate any "prior information". Moreover, the indices allow for complete or partial specification of relationships at a pairwise level. Aside from these, two other main benefits of the proposed methods are that they do not require the estimation of factor scores, which avoids the factor score indeterminacy problem, and that no additional explanatory variables are necessary. The implementation of the proposed methods is written in Python 3 and is made available together with several helper functions through the package interpretablefa on the Python Package Index. The methods' application is demonstrated here using data on the Experiences in Close Relationships Scale, obtained from the Open-Source Psychometrics Project.
因子分析是用较少数量的不可观测随机变量来描述许多(可观测)变量之间关系的一种方法,这些变量被称为因子。然而,因子模型的应用及其成功与否可能是主观的或难以衡量的,因为在给定的抽样数据中,可以拟合出产生相同相关矩阵的无限多个因子模型。因此,需要有一个可操作的标准来衡量因子模型的意义或 "可解释性",以便在众多因子模型中选出最佳模型。虽然目前已经有了一些旨在测量和增强可解释性的技术,但我们还是提出了一些新的指数以及基于这些指数的数学优化旋转方法来测量可解释性。所提出的方法借助自然语言处理技术直接将语义纳入其中,并将其推广到任何 "先验信息 "中。此外,这些指数允许在成对水平上对关系进行完整或部分说明。除此之外,所提方法还有两个主要优点,一是不需要估计因子分数,从而避免了因子分数不确定的问题,二是不需要额外的解释变量。所提方法的实现是用 Python 3 编写的,并通过 Python 软件包索引中的软件包interpretablefa 与几个辅助函数一起提供。本文使用从开源心理测量项目(Open-Source Psychometrics Project)获得的亲密关系体验量表(Experiences in Close Relationships Scale)数据来演示这些方法的应用。
{"title":"Interpretability Indices and Soft Constraints for Factor Models","authors":"Justin Philip Tuazon, Gia Mizrane Abubo, Joemari Olea","doi":"arxiv-2409.11525","DOIUrl":"https://doi.org/arxiv-2409.11525","url":null,"abstract":"Factor analysis is a way to characterize the relationships between many\u0000(observable) variables in terms of a smaller number of unobservable random\u0000variables which are called factors. However, the application of factor models\u0000and its success can be subjective or difficult to gauge, since infinitely many\u0000factor models that produce the same correlation matrix can be fit given sample\u0000data. Thus, there is a need to operationalize a criterion that measures how\u0000meaningful or \"interpretable\" a factor model is in order to select the best\u0000among many factor models. While there are already techniques that aim to measure and enhance\u0000interpretability, new indices, as well as rotation methods via mathematical\u0000optimization based on them, are proposed to measure interpretability. The\u0000proposed methods directly incorporate semantics with the help of natural\u0000language processing and are generalized to incorporate any \"prior information\".\u0000Moreover, the indices allow for complete or partial specification of\u0000relationships at a pairwise level. Aside from these, two other main benefits of\u0000the proposed methods are that they do not require the estimation of factor\u0000scores, which avoids the factor score indeterminacy problem, and that no\u0000additional explanatory variables are necessary. The implementation of the proposed methods is written in Python 3 and is made\u0000available together with several helper functions through the package\u0000interpretablefa on the Python Package Index. The methods' application is\u0000demonstrated here using data on the Experiences in Close Relationships Scale,\u0000obtained from the Open-Source Psychometrics Project.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research deals with the estimation and imputation of missing data in longitudinal models with a Poisson response variable inflated with zeros. A methodology is proposed that is based on the use of maximum likelihood, assuming that data is missing at random and that there is a correlation between the response variables. In each of the times, the expectation maximization (EM) algorithm is used: in step E, a weighted regression is carried out, conditioned on the previous times that are taken as covariates. In step M, the estimation and imputation of the missing data are performed. The good performance of the methodology in different loss scenarios is demonstrated in a simulation study comparing the model only with complete data, and estimating missing data using the mode of the data of each individual. Furthermore, in a study related to the growth of corn, it is tested on real data to develop the algorithm in a practical scenario.
本研究探讨了在纵向模型中,对带有零填充的泊松响应变量的缺失数据进行估计和估算的问题。研究提出了一种基于最大似然法的方法,假设数据是随机缺失的,且响应变量之间存在相关性。在每个时间段,都使用期望最大化(EM)算法:在步骤 E 中,以作为协变量的前几个时间段为条件,进行加权回归。在步骤 M 中,对缺失数据进行估计和估算。在一项模拟研究中,仅使用完整数据对模型进行了比较,并使用每个个体的数据模式对缺失数据进行了估计,结果表明该方法在不同的损失情况下具有良好的性能。此外,在一项与玉米生长相关的研究中,对真实数据进行了测试,以便在实际场景中开发算法。
{"title":"Estimation and imputation of missing data in longitudinal models with Zero-Inflated Poisson response variable","authors":"D. S. Martinez-Lobo, O. O. Melo, N. A. Cruz","doi":"arxiv-2409.11040","DOIUrl":"https://doi.org/arxiv-2409.11040","url":null,"abstract":"This research deals with the estimation and imputation of missing data in\u0000longitudinal models with a Poisson response variable inflated with zeros. A\u0000methodology is proposed that is based on the use of maximum likelihood,\u0000assuming that data is missing at random and that there is a correlation between\u0000the response variables. In each of the times, the expectation maximization (EM)\u0000algorithm is used: in step E, a weighted regression is carried out, conditioned\u0000on the previous times that are taken as covariates. In step M, the estimation\u0000and imputation of the missing data are performed. The good performance of the\u0000methodology in different loss scenarios is demonstrated in a simulation study\u0000comparing the model only with complete data, and estimating missing data using\u0000the mode of the data of each individual. Furthermore, in a study related to the\u0000growth of corn, it is tested on real data to develop the algorithm in a\u0000practical scenario.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"203 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The probability-scale residual (PSR) is defined as $E{sign(y, Y^*)}$, where $y$ is the observed outcome and $Y^*$ is a random variable from the fitted distribution. The PSR is particularly useful for ordinal and censored outcomes for which fitted values are not available without additional assumptions. Previous work has defined the PSR for continuous, binary, ordinal, right-censored, and current status outcomes; however, development of the PSR has not yet been considered for data subject to general interval censoring. We develop extensions of the PSR, first to mixed-case interval-censored data, and then to data subject to several types of common censoring schemes. We derive the statistical properties of the PSR and show that our more general PSR encompasses several previously defined PSR for continuous and censored outcomes as special cases. The performance of the residual is illustrated in real data from the Caribbean, Central, and South American Network for HIV Epidemiology.
概率标度残差(PSR)定义为 $E{sign(y,Y^*)}$,其中$y$为观测结果,$Y^*$为拟合分布中的随机变量。以前的工作已经定义了连续、二元、序数、右删减和当前状态结果的 PSR;但是,对于一般区间删减的数据,尚未考虑开发 PSR。我们对 PSR 进行了扩展,首先适用于混合情况下的区间删失数据,然后适用于几种常见删失方案下的数据。我们推导出了 PSR 的统计特性,并表明我们更通用的 PSR 包含了之前定义的几种用于连续和剔除结果的 PSR 作为特例。来自加勒比、中美洲和南美洲艾滋病流行病学网络的真实数据说明了残差的性能。
{"title":"Probability-scale residuals for event-time data","authors":"Eric S. Kawaguchi, Bryan E. Shepherd, Chun Li","doi":"arxiv-2409.11385","DOIUrl":"https://doi.org/arxiv-2409.11385","url":null,"abstract":"The probability-scale residual (PSR) is defined as $E{sign(y, Y^*)}$, where\u0000$y$ is the observed outcome and $Y^*$ is a random variable from the fitted\u0000distribution. The PSR is particularly useful for ordinal and censored outcomes\u0000for which fitted values are not available without additional assumptions.\u0000Previous work has defined the PSR for continuous, binary, ordinal,\u0000right-censored, and current status outcomes; however, development of the PSR\u0000has not yet been considered for data subject to general interval censoring. We\u0000develop extensions of the PSR, first to mixed-case interval-censored data, and\u0000then to data subject to several types of common censoring schemes. We derive\u0000the statistical properties of the PSR and show that our more general PSR\u0000encompasses several previously defined PSR for continuous and censored outcomes\u0000as special cases. The performance of the residual is illustrated in real data\u0000from the Caribbean, Central, and South American Network for HIV Epidemiology.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the BMRMM package implementing Bayesian inference for a class of Markov renewal mixed models which can characterize the stochastic dynamics of a collection of sequences, each comprising alternative instances of categorical states and associated continuous duration times, while being influenced by a set of exogenous factors as well as a 'random' individual. The default setting flexibly models the state transition probabilities using mixtures of Dirichlet distributions and the duration times using mixtures of gamma kernels while also allowing variable selection for both. Modeling such data using simpler Markov mixed models also remains an option, either by ignoring the duration times altogether or by replacing them with instances of an additional category obtained by discretizing them by a user-specified unit. The option is also useful when data on duration times may not be available in the first place. We demonstrate the package's utility using two data sets.
{"title":"BMRMM: An R Package for Bayesian Markov (Renewal) Mixed Models","authors":"Yutong Wu, Abhra Sarkar","doi":"arxiv-2409.10835","DOIUrl":"https://doi.org/arxiv-2409.10835","url":null,"abstract":"We introduce the BMRMM package implementing Bayesian inference for a class of\u0000Markov renewal mixed models which can characterize the stochastic dynamics of a\u0000collection of sequences, each comprising alternative instances of categorical\u0000states and associated continuous duration times, while being influenced by a\u0000set of exogenous factors as well as a 'random' individual. The default setting\u0000flexibly models the state transition probabilities using mixtures of Dirichlet\u0000distributions and the duration times using mixtures of gamma kernels while also\u0000allowing variable selection for both. Modeling such data using simpler Markov\u0000mixed models also remains an option, either by ignoring the duration times\u0000altogether or by replacing them with instances of an additional category\u0000obtained by discretizing them by a user-specified unit. The option is also\u0000useful when data on duration times may not be available in the first place. We\u0000demonstrate the package's utility using two data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew J. Smith, Rachael V. Phillips, Camille Maringe, Miguel Angel Luque Fernandez
Background: Advanced methods for causal inference, such as targeted maximum likelihood estimation (TMLE), require certain conditions for statistical inference. However, in situations where there is not differentiability due to data sparsity or near-positivity violations, the Donsker class condition is violated. In such situations, TMLE variance can suffer from inflation of the type I error and poor coverage, leading to conservative confidence intervals. Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve on performance compared to TMLE in settings of positivity or Donsker class violations. We aim to investigate the performance of CVTMLE compared to TMLE in various settings. Methods: We utilised the data-generating mechanism as described in Leger et al. (2022) to run a Monte Carlo experiment under different Donsker class violations. Then, we evaluated the respective statistical performances of TMLE and CVTMLE with different super learner libraries, with and without regression tree methods. Results: We found that CVTMLE vastly improves confidence interval coverage without adversely affecting bias, particularly in settings with small sample sizes and near-positivity violations. Furthermore, incorporating regression trees using standard TMLE with ensemble super learner-based initial estimates increases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker class condition is no longer necessary to obtain valid statistical inference when using regression trees and under either data sparsity or near-positivity violations. We show through simulations that CVTMLE is much less sensitive to the choice of the super learner library and thereby provides better estimation and inference in cases where the super learner library uses more flexible candidates and is prone to overfitting.
{"title":"Performance of Cross-Validated Targeted Maximum Likelihood Estimation","authors":"Matthew J. Smith, Rachael V. Phillips, Camille Maringe, Miguel Angel Luque Fernandez","doi":"arxiv-2409.11265","DOIUrl":"https://doi.org/arxiv-2409.11265","url":null,"abstract":"Background: Advanced methods for causal inference, such as targeted maximum\u0000likelihood estimation (TMLE), require certain conditions for statistical\u0000inference. However, in situations where there is not differentiability due to\u0000data sparsity or near-positivity violations, the Donsker class condition is\u0000violated. In such situations, TMLE variance can suffer from inflation of the\u0000type I error and poor coverage, leading to conservative confidence intervals.\u0000Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve\u0000on performance compared to TMLE in settings of positivity or Donsker class\u0000violations. We aim to investigate the performance of CVTMLE compared to TMLE in\u0000various settings. Methods: We utilised the data-generating mechanism as described in Leger et\u0000al. (2022) to run a Monte Carlo experiment under different Donsker class\u0000violations. Then, we evaluated the respective statistical performances of TMLE\u0000and CVTMLE with different super learner libraries, with and without regression\u0000tree methods. Results: We found that CVTMLE vastly improves confidence interval coverage\u0000without adversely affecting bias, particularly in settings with small sample\u0000sizes and near-positivity violations. Furthermore, incorporating regression\u0000trees using standard TMLE with ensemble super learner-based initial estimates\u0000increases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker class\u0000condition is no longer necessary to obtain valid statistical inference when\u0000using regression trees and under either data sparsity or near-positivity\u0000violations. We show through simulations that CVTMLE is much less sensitive to\u0000the choice of the super learner library and thereby provides better estimation\u0000and inference in cases where the super learner library uses more flexible\u0000candidates and is prone to overfitting.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Survival regression is widely used to model time-to-events data, to explore how covariates may influence the occurrence of events. Modern datasets often encompass a vast number of covariates across many subjects, with only a subset of the covariates significantly affecting survival. Additionally, subjects often belong to an unknown number of latent groups, where covariate effects on survival differ significantly across groups. The proposed methodology addresses both challenges by simultaneously identifying the latent sub-groups in the heterogeneous population and evaluating covariate significance within each sub-group. This approach is shown to enhance the predictive accuracy for time-to-event outcomes, via uncovering varying risk profiles within the underlying heterogeneous population and is thereby helpful to device targeted disease management strategies.
{"title":"Flexible survival regression with variable selection for heterogeneous population","authors":"Abhishek Mandal, Abhisek Chakraborty","doi":"arxiv-2409.10771","DOIUrl":"https://doi.org/arxiv-2409.10771","url":null,"abstract":"Survival regression is widely used to model time-to-events data, to explore\u0000how covariates may influence the occurrence of events. Modern datasets often\u0000encompass a vast number of covariates across many subjects, with only a subset\u0000of the covariates significantly affecting survival. Additionally, subjects\u0000often belong to an unknown number of latent groups, where covariate effects on\u0000survival differ significantly across groups. The proposed methodology addresses\u0000both challenges by simultaneously identifying the latent sub-groups in the\u0000heterogeneous population and evaluating covariate significance within each\u0000sub-group. This approach is shown to enhance the predictive accuracy for\u0000time-to-event outcomes, via uncovering varying risk profiles within the\u0000underlying heterogeneous population and is thereby helpful to device targeted\u0000disease management strategies.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The family of cure models provides a unique opportunity to simultaneously model both the proportion of cured subjects (those not facing the event of interest) and the distribution function of time-to-event for susceptibles (those facing the event). In practice, the application of cure models is mainly facilitated by the availability of various R packages. However, most of these packages primarily focus on the mixture or promotion time cure rate model. This article presents a fully Bayesian approach implemented in R to estimate a general family of cure rate models in the presence of covariates. It builds upon the work by Papastamoulis and Milienos (2024) by additionally considering various options for describing the promotion time, including the Weibull, exponential, Gompertz, log-logistic and finite mixtures of gamma distributions, among others. Moreover, the user can choose any proper distribution function for modeling the promotion time (provided that some specific conditions are met). Posterior inference is carried out by constructing a Metropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the latent cure indicators and Metropolis-Hastings steps with Langevin diffusion dynamics for parameter updates. The main MCMC algorithm is embedded within a parallel tempering scheme by considering heated versions of the target posterior distribution. The package is illustrated on a real dataset analyzing the duration of the first marriage under the presence of various covariates such as the race, age and the presence of kids.
{"title":"bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in R","authors":"Panagiotis Papastamoulis, Fotios Milienos","doi":"arxiv-2409.10221","DOIUrl":"https://doi.org/arxiv-2409.10221","url":null,"abstract":"The family of cure models provides a unique opportunity to simultaneously\u0000model both the proportion of cured subjects (those not facing the event of\u0000interest) and the distribution function of time-to-event for susceptibles\u0000(those facing the event). In practice, the application of cure models is mainly\u0000facilitated by the availability of various R packages. However, most of these\u0000packages primarily focus on the mixture or promotion time cure rate model. This\u0000article presents a fully Bayesian approach implemented in R to estimate a\u0000general family of cure rate models in the presence of covariates. It builds\u0000upon the work by Papastamoulis and Milienos (2024) by additionally considering\u0000various options for describing the promotion time, including the Weibull,\u0000exponential, Gompertz, log-logistic and finite mixtures of gamma distributions,\u0000among others. Moreover, the user can choose any proper distribution function\u0000for modeling the promotion time (provided that some specific conditions are\u0000met). Posterior inference is carried out by constructing a Metropolis-coupled\u0000Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the\u0000latent cure indicators and Metropolis-Hastings steps with Langevin diffusion\u0000dynamics for parameter updates. The main MCMC algorithm is embedded within a\u0000parallel tempering scheme by considering heated versions of the target\u0000posterior distribution. The package is illustrated on a real dataset analyzing\u0000the duration of the first marriage under the presence of various covariates\u0000such as the race, age and the presence of kids.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"183 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM's superiority in handling discrete and mixed-type variables. An empirical data analysis of the company's operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix.
{"title":"Generalized Matrix Factor Model","authors":"Xinbing Kong, Tong Zhang","doi":"arxiv-2409.10001","DOIUrl":"https://doi.org/arxiv-2409.10001","url":null,"abstract":"This article introduces a nonlinear generalized matrix factor model (GMFM)\u0000that allows for mixed-type variables, extending the scope of linear matrix\u0000factor models (LMFM) that are so far limited to handling continuous variables.\u0000We introduce a novel augmented Lagrange multiplier method, equivalent to the\u0000constraint maximum likelihood estimation, and carefully tailored to be locally\u0000concave around the true factor and loading parameters. This statistically\u0000guarantees the local convexity of the negative Hessian matrix around the true\u0000parameters of the factors and loadings, which is nontrivial in the matrix\u0000factor modeling and leads to feasible central limit theorems of the estimated\u0000factors and loadings. We also theoretically establish the convergence rates of\u0000the estimated factor and loading matrices for the GMFM under general conditions\u0000that allow for correlations across samples, rows, and columns. Moreover, we\u0000provide a model selection criterion to determine the numbers of row and column\u0000factors consistently. To numerically compute the constraint maximum likelihood\u0000estimator, we provide two algorithms: two-stage alternating maximization and\u0000minorization maximization. Extensive simulation studies demonstrate GMFM's\u0000superiority in handling discrete and mixed-type variables. An empirical data\u0000analysis of the company's operating performance shows that GMFM does clustering\u0000and reconstruction well in the presence of discontinuous entries in the data\u0000matrix.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}