Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf024
Phillip B Nicol, Jeffrey W Miller
Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.
{"title":"Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models.","authors":"Phillip B Nicol, Jeffrey W Miller","doi":"10.1093/biostatistics/kxaf024","DOIUrl":"10.1093/biostatistics/kxaf024","url":null,"abstract":"<p><p>Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf034
Dirk Douwes-Schultz, Alexandra M Schmidt, Laís Picinini Freitas, Marilia Sá Carvalho
Univariate zero-inflated models are increasingly being used to account for excess zeros in spatio-temporal infectious disease counts. However, the multivariate case is challenging due to the need to account for correlations across space, time and disease in both the count and zero-inflated components of the model. We are interested in comparing the transmission dynamics of several co-circulating infectious diseases across space and time, where some of the diseases can be absent for long periods. We first assume there is a baseline disease that is well-established and always present in the region. The other diseases switch between periods of presence and absence in each area through a series of coupled Markov chains, which account for long periods of disease absence, disease interactions and disease spread from neighboring areas. Since we are mainly interested in comparing the diseases, we assume the cases of the present diseases in an area jointly follow an autoregressive multinomial model. We use the multinomial model to investigate whether there are associations between certain factors, such as temperature, and differences in the transmission intensity of the diseases. Inference is performed using efficient Bayesian Markov chain Monte Carlo methods based on jointly sampling all unknown presence indicators. We apply the model to spatio-temporal counts of dengue, Zika, and chikungunya cases in Rio de Janeiro, during the first triple epidemic there.
{"title":"Markov switching zero-inflated space-time multinomial models for comparing multiple infectious diseases.","authors":"Dirk Douwes-Schultz, Alexandra M Schmidt, Laís Picinini Freitas, Marilia Sá Carvalho","doi":"10.1093/biostatistics/kxaf034","DOIUrl":"10.1093/biostatistics/kxaf034","url":null,"abstract":"<p><p>Univariate zero-inflated models are increasingly being used to account for excess zeros in spatio-temporal infectious disease counts. However, the multivariate case is challenging due to the need to account for correlations across space, time and disease in both the count and zero-inflated components of the model. We are interested in comparing the transmission dynamics of several co-circulating infectious diseases across space and time, where some of the diseases can be absent for long periods. We first assume there is a baseline disease that is well-established and always present in the region. The other diseases switch between periods of presence and absence in each area through a series of coupled Markov chains, which account for long periods of disease absence, disease interactions and disease spread from neighboring areas. Since we are mainly interested in comparing the diseases, we assume the cases of the present diseases in an area jointly follow an autoregressive multinomial model. We use the multinomial model to investigate whether there are associations between certain factors, such as temperature, and differences in the transmission intensity of the diseases. Inference is performed using efficient Bayesian Markov chain Monte Carlo methods based on jointly sampling all unknown presence indicators. We apply the model to spatio-temporal counts of dengue, Zika, and chikungunya cases in Rio de Janeiro, during the first triple epidemic there.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596980/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf035
Luka Kovačević, Weishi Chen, Helen Barnett, Thomas Jaki, Pavel Mozgunov
Phase I clinical trials are essential to bringing novel therapies from chemical development to widespread use. Traditional approaches to dose-finding in Phase I trials, such as the '3 + 3' method and the continual reassessment method (CRM), provide a principled approach for escalating across dose levels. However, these methods lack the ability to incorporate uncertainty regarding the dose-toxicity ordering as found in combination drug trials. Under this setting, dose levels vary across multiple drugs simultaneously, leading to multiple possible dose-toxicity orderings. The CRM for partial ordering (POCRM) extends to these settings by allowing for multiple dose-toxicity orderings. In this work, it is shown that the POCRM is vulnerable to 'estimation incoherency' whereby toxicity estimates shift in an illogical way, threatening patient safety and undermining clinician trust in dose-finding models. To this end, the Bayesian model averaged POCRM (BMA-POCRM) is formalized. BMA-POCRM uses Bayesian model averaging to take into account all possible orderings simultaneously, reducing the frequency of estimation incoherencies. We derive novel theoretical guarantees on the estimation coherency of the POCRM and BMA-POCRM. The effectiveness of BMA-POCRM in drug combination settings is demonstrated through a specific instance of estimate incoherency of POCRM and simulation studies. The results highlight the improved safety, accuracy, and reduced occurrence of estimate incoherency in trials applying the BMA-POCRM relative to the POCRM model.
{"title":"Bayesian model averaging for partial ordering continual reassessment methods.","authors":"Luka Kovačević, Weishi Chen, Helen Barnett, Thomas Jaki, Pavel Mozgunov","doi":"10.1093/biostatistics/kxaf035","DOIUrl":"10.1093/biostatistics/kxaf035","url":null,"abstract":"<p><p>Phase I clinical trials are essential to bringing novel therapies from chemical development to widespread use. Traditional approaches to dose-finding in Phase I trials, such as the '3 + 3' method and the continual reassessment method (CRM), provide a principled approach for escalating across dose levels. However, these methods lack the ability to incorporate uncertainty regarding the dose-toxicity ordering as found in combination drug trials. Under this setting, dose levels vary across multiple drugs simultaneously, leading to multiple possible dose-toxicity orderings. The CRM for partial ordering (POCRM) extends to these settings by allowing for multiple dose-toxicity orderings. In this work, it is shown that the POCRM is vulnerable to 'estimation incoherency' whereby toxicity estimates shift in an illogical way, threatening patient safety and undermining clinician trust in dose-finding models. To this end, the Bayesian model averaged POCRM (BMA-POCRM) is formalized. BMA-POCRM uses Bayesian model averaging to take into account all possible orderings simultaneously, reducing the frequency of estimation incoherencies. We derive novel theoretical guarantees on the estimation coherency of the POCRM and BMA-POCRM. The effectiveness of BMA-POCRM in drug combination settings is demonstrated through a specific instance of estimate incoherency of POCRM and simulation studies. The results highlight the improved safety, accuracy, and reduced occurrence of estimate incoherency in trials applying the BMA-POCRM relative to the POCRM model.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12538209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145338280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In developing risk prediction models for specific diseases, it is essential to evaluate the calibration performance of the prediction model. Various methods have been proposed to assess the calibration of prediction models, but it has been pointed out that conventional methods based on the predicted probability of the model are insufficient to detect miscalibration. Another problem is that a method for evaluating calibration for continuous variables of interest has not yet been established. We therefore propose two methods to evaluate the calibration of the variable of interest: the variable-based probabilistic calibration plot (VPC-Plot), which is a visual assessment, and the variable-based probabilistic calibration error (VPCE), which is a corresponding evaluation metric. We conducted theoretical and simulation studies to investigate the properties and effectiveness of the proposed method. Theoretical and simulation studies demonstrated that the proposed methods can detect miscalibration by evaluating the calibration based on the variable of interest, even when conventional methods fail to detect miscalibration. To show the usefulness in the real-world data analysis, we evaluated diabetes prediction models developed using the national health insurance database for Osaka, Japan. We show that the proposed method can identify miscalibration of key covariate in a diabetes prediction model.
{"title":"Variable-based probabilistic calibration with binary outcome.","authors":"Hiroe Seto, Shuji Kitora, Asuka Oyama, Hiroshi Toki, Ryohei Yamamoto, Michio Yamamoto","doi":"10.1093/biostatistics/kxaf026","DOIUrl":"10.1093/biostatistics/kxaf026","url":null,"abstract":"<p><p>In developing risk prediction models for specific diseases, it is essential to evaluate the calibration performance of the prediction model. Various methods have been proposed to assess the calibration of prediction models, but it has been pointed out that conventional methods based on the predicted probability of the model are insufficient to detect miscalibration. Another problem is that a method for evaluating calibration for continuous variables of interest has not yet been established. We therefore propose two methods to evaluate the calibration of the variable of interest: the variable-based probabilistic calibration plot (VPC-Plot), which is a visual assessment, and the variable-based probabilistic calibration error (VPCE), which is a corresponding evaluation metric. We conducted theoretical and simulation studies to investigate the properties and effectiveness of the proposed method. Theoretical and simulation studies demonstrated that the proposed methods can detect miscalibration by evaluating the calibration based on the variable of interest, even when conventional methods fail to detect miscalibration. To show the usefulness in the real-world data analysis, we evaluated diabetes prediction models developed using the national health insurance database for Osaka, Japan. We show that the proposed method can identify miscalibration of key covariate in a diabetes prediction model.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145314132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf033
Emily Somerset, Justin J Slater, Patrick E Brown
We introduce a hierarchical Bayesian framework for reconstructing epidemic curves using under-reported case counts and wastewater data. Our approach models wastewater signals as differentiable Gaussian processes, enabling inference on their relative growth rates, which are used to define a wastewater-based reproduction rate. These estimates are incorporated into a binomially thinned Poisson autoregressive model for case counts using a modular inference strategy. We apply this framework to reconstruct the Covid-19 epidemic curve in Toronto, validating our model through out-of-sample forecasts and comparisons with independent serosurvey-based cumulative incidence estimates. We also apply the framework to New Zealand's Covid-19 data to reconstruct its epidemic curve and demonstrate improvements over an existing joint model for wastewater and case data. A key advantage of our framework, highlighted in this comparison, is that it does not rely on pre-specified constant parameters, allowing the model to better adapt to evolving pandemic conditions.
{"title":"Wastewater-based reproduction rates for epidemic curve reconstruction.","authors":"Emily Somerset, Justin J Slater, Patrick E Brown","doi":"10.1093/biostatistics/kxaf033","DOIUrl":"10.1093/biostatistics/kxaf033","url":null,"abstract":"<p><p>We introduce a hierarchical Bayesian framework for reconstructing epidemic curves using under-reported case counts and wastewater data. Our approach models wastewater signals as differentiable Gaussian processes, enabling inference on their relative growth rates, which are used to define a wastewater-based reproduction rate. These estimates are incorporated into a binomially thinned Poisson autoregressive model for case counts using a modular inference strategy. We apply this framework to reconstruct the Covid-19 epidemic curve in Toronto, validating our model through out-of-sample forecasts and comparisons with independent serosurvey-based cumulative incidence estimates. We also apply the framework to New Zealand's Covid-19 data to reconstruct its epidemic curve and demonstrate improvements over an existing joint model for wastewater and case data. A key advantage of our framework, highlighted in this comparison, is that it does not rely on pre-specified constant parameters, allowing the model to better adapt to evolving pandemic conditions.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12533577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145314108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf040
Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani
Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.
异质性治疗效应(Heterogeneous treatment effect, HTE)是指群体中个体治疗效果的非随机、可解释的变异。HTE估计是精准医疗的核心,准确的效果估计可以为个性化治疗决策提供信息。在实践中,患者可以呈现与多个研究重叠的协变量概况,这增加了在多研究环境中为治疗决策提供最佳信息的挑战。我们提出了一个灵活的统计机器学习(ML)框架,即多研究R学习器,它利用多个研究来估计HTE。现有的多研究方法通常假设研究特异性(i)条件平均治疗效果(CATE), (ii)在没有给定协变量的治疗下的预期潜在结果,以及(iii)治疗分配机制在研究中是相同的,但由于研究人群、方案或设计的差异,这些假设在实践中可能不成立。为此,我们开发了我们的框架来直接解释这三种类型的研究间异质性。它建立在交叉研究学习的最新进展基础上,并使用数据自适应目标函数,通过隶属关系概率将交叉研究中妨害函数的估计与研究特定的CATEs结合起来,从而使信息能够跨研究借鉴。多学习$ R $学习器将$ R $学习器扩展到多学习环境,并且在结合ML技术方面具有灵活性。在序列估计框架中,我们证明了所提出的方法是渐近正态的,并且在治疗分配机制存在研究间异质性时比$ R $学习器更有效。我们使用随机对照试验和观察性研究的癌症数据说明,在研究间异质性存在的情况下,多研究$ R $学习器表现良好。
{"title":"Multi-study R-learner for estimating heterogeneous treatment effects across studies using statistical machine learning.","authors":"Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani","doi":"10.1093/biostatistics/kxaf040","DOIUrl":"10.1093/biostatistics/kxaf040","url":null,"abstract":"<p><p>Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12713001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf029
Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado
Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.
{"title":"Bayesian scalar-on-tensor regression using the Tucker decomposition for sparse spatial modeling.","authors":"Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado","doi":"10.1093/biostatistics/kxaf029","DOIUrl":"10.1093/biostatistics/kxaf029","url":null,"abstract":"<p><p>Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf039
Tom Chen, Fan Li, Rui Wang
Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.
{"title":"Network generalized estimating equations for complexly correlated data with applications to cluster randomized trials.","authors":"Tom Chen, Fan Li, Rui Wang","doi":"10.1093/biostatistics/kxaf039","DOIUrl":"10.1093/biostatistics/kxaf039","url":null,"abstract":"<p><p>Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12766921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf004
Yixi Xu, Yi Zhao
This study introduces a mediation analysis framework when the mediator is a graph. A Gaussian covariance graph model is assumed for graph presentation. Causal estimands and assumptions are discussed under this presentation. With a covariance matrix as the mediator, a low-rank representation is introduced and parametric mediation models are considered under the structural equation modeling framework. Assuming Gaussian random errors, likelihood-based estimators are introduced to simultaneously identify the low-rank representation and causal parameters. An efficient computational algorithm is proposed and asymptotic properties of the estimators are investigated. Via simulation studies, the performance of the proposed approach is evaluated. Applying to a resting-state fMRI study, a brain network is identified within which functional connectivity mediates the sex difference in the performance of a motor task.
{"title":"Mediation analysis with graph mediator.","authors":"Yixi Xu, Yi Zhao","doi":"10.1093/biostatistics/kxaf004","DOIUrl":"10.1093/biostatistics/kxaf004","url":null,"abstract":"<p><p>This study introduces a mediation analysis framework when the mediator is a graph. A Gaussian covariance graph model is assumed for graph presentation. Causal estimands and assumptions are discussed under this presentation. With a covariance matrix as the mediator, a low-rank representation is introduced and parametric mediation models are considered under the structural equation modeling framework. Assuming Gaussian random errors, likelihood-based estimators are introduced to simultaneously identify the low-rank representation and causal parameters. An efficient computational algorithm is proposed and asymptotic properties of the estimators are investigated. Via simulation studies, the performance of the proposed approach is evaluated. Applying to a resting-state fMRI study, a brain network is identified within which functional connectivity mediates the sex difference in the performance of a motor task.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11979487/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143626882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf001
Sandra E Safo, Han Lu
There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.
{"title":"Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease.","authors":"Sandra E Safo, Han Lu","doi":"10.1093/biostatistics/kxaf001","DOIUrl":"10.1093/biostatistics/kxaf001","url":null,"abstract":"<p><p>There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839864/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143460884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}