Biostatistics最新文献_第3页

Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models. 利用广义双线性模型对单细胞RNA-seq进行基于模型的降维。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf024

Phillip B Nicol, Jeffrey W Miller

Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.

降维是单细胞RNA-seq （scRNA-seq）数据分析的关键步骤。标准的方法是对计数矩阵进行变换，然后进行主成分分析（PCA）。然而，这种方法可以诱导虚假的异质性和掩盖真正的生物变异性。另一种方法是直接对计数进行建模，但现有方法在大型数据集上往往难以计算，并且不能量化低维表示中的不确定性。为了解决这些问题，我们开发了scGBM，这是一种使用泊松双线性模型对scRNA-seq数据进行基于模型的降维的新方法。我们引入了一种快速估计算法，使用迭代重加权奇异值分解来拟合模型，使该方法能够扩展到具有数百万单元格的数据集。此外，scGBM量化了每个细胞潜在位置的不确定性，并利用这些不确定性来评估与给定细胞集群相关的置信度。在真实和模拟的单细胞数据中，我们发现scGBM产生的低维嵌入可以更好地捕获相关的生物信息，同时消除不必要的变异。

{"title":"Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models.","authors":"Phillip B Nicol, Jeffrey W Miller","doi":"10.1093/biostatistics/kxaf024","DOIUrl":"10.1093/biostatistics/kxaf024","url":null,"abstract":"Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Markov switching zero-inflated space-time multinomial models for comparing multiple infectious diseases. 比较多种传染病的马尔可夫切换零膨胀时空多项模型。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf034

Dirk Douwes-Schultz, Alexandra M Schmidt, Laís Picinini Freitas, Marilia Sá Carvalho

Univariate zero-inflated models are increasingly being used to account for excess zeros in spatio-temporal infectious disease counts. However, the multivariate case is challenging due to the need to account for correlations across space, time and disease in both the count and zero-inflated components of the model. We are interested in comparing the transmission dynamics of several co-circulating infectious diseases across space and time, where some of the diseases can be absent for long periods. We first assume there is a baseline disease that is well-established and always present in the region. The other diseases switch between periods of presence and absence in each area through a series of coupled Markov chains, which account for long periods of disease absence, disease interactions and disease spread from neighboring areas. Since we are mainly interested in comparing the diseases, we assume the cases of the present diseases in an area jointly follow an autoregressive multinomial model. We use the multinomial model to investigate whether there are associations between certain factors, such as temperature, and differences in the transmission intensity of the diseases. Inference is performed using efficient Bayesian Markov chain Monte Carlo methods based on jointly sampling all unknown presence indicators. We apply the model to spatio-temporal counts of dengue, Zika, and chikungunya cases in Rio de Janeiro, during the first triple epidemic there.

单变量零膨胀模型越来越多地被用于解释时空传染病计数中的超额零。然而，多变量情况具有挑战性，因为需要在模型的计数和零膨胀成分中考虑到空间、时间和疾病之间的相关性。我们感兴趣的是比较几种共循环传染病在空间和时间上的传播动力学，其中一些疾病可以长时间不存在。我们首先假设存在一种基线疾病，该疾病在该地区得到确认并一直存在。其他疾病通过一系列耦合的马尔可夫链在每个地区存在和不存在的时期之间切换，这解释了长时间的疾病缺失，疾病相互作用和疾病从邻近地区传播。由于我们主要对疾病的比较感兴趣，我们假设一个地区的现有疾病病例共同遵循自回归多项式模型。我们使用多项模型来研究某些因素（如温度）与疾病传播强度的差异之间是否存在关联。基于联合采样所有未知存在指标，使用有效的贝叶斯马尔可夫链蒙特卡罗方法进行推理。我们将该模型应用于巴西里约热内卢首次三重流行期间登革热、寨卡和基孔肯雅病例的时空计数。

{"title":"Markov switching zero-inflated space-time multinomial models for comparing multiple infectious diseases.","authors":"Dirk Douwes-Schultz, Alexandra M Schmidt, Laís Picinini Freitas, Marilia Sá Carvalho","doi":"10.1093/biostatistics/kxaf034","DOIUrl":"10.1093/biostatistics/kxaf034","url":null,"abstract":"Univariate zero-inflated models are increasingly being used to account for excess zeros in spatio-temporal infectious disease counts. However, the multivariate case is challenging due to the need to account for correlations across space, time and disease in both the count and zero-inflated components of the model. We are interested in comparing the transmission dynamics of several co-circulating infectious diseases across space and time, where some of the diseases can be absent for long periods. We first assume there is a baseline disease that is well-established and always present in the region. The other diseases switch between periods of presence and absence in each area through a series of coupled Markov chains, which account for long periods of disease absence, disease interactions and disease spread from neighboring areas. Since we are mainly interested in comparing the diseases, we assume the cases of the present diseases in an area jointly follow an autoregressive multinomial model. We use the multinomial model to investigate whether there are associations between certain factors, such as temperature, and differences in the transmission intensity of the diseases. Inference is performed using efficient Bayesian Markov chain Monte Carlo methods based on jointly sampling all unknown presence indicators. We apply the model to spatio-temporal counts of dengue, Zika, and chikungunya cases in Rio de Janeiro, during the first triple epidemic there.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12596980/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian model averaging for partial ordering continual reassessment methods. 偏序连续重评价的贝叶斯模型平均方法。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf035

Luka Kovačević, Weishi Chen, Helen Barnett, Thomas Jaki, Pavel Mozgunov

Phase I clinical trials are essential to bringing novel therapies from chemical development to widespread use. Traditional approaches to dose-finding in Phase I trials, such as the '3 + 3' method and the continual reassessment method (CRM), provide a principled approach for escalating across dose levels. However, these methods lack the ability to incorporate uncertainty regarding the dose-toxicity ordering as found in combination drug trials. Under this setting, dose levels vary across multiple drugs simultaneously, leading to multiple possible dose-toxicity orderings. The CRM for partial ordering (POCRM) extends to these settings by allowing for multiple dose-toxicity orderings. In this work, it is shown that the POCRM is vulnerable to 'estimation incoherency' whereby toxicity estimates shift in an illogical way, threatening patient safety and undermining clinician trust in dose-finding models. To this end, the Bayesian model averaged POCRM (BMA-POCRM) is formalized. BMA-POCRM uses Bayesian model averaging to take into account all possible orderings simultaneously, reducing the frequency of estimation incoherencies. We derive novel theoretical guarantees on the estimation coherency of the POCRM and BMA-POCRM. The effectiveness of BMA-POCRM in drug combination settings is demonstrated through a specific instance of estimate incoherency of POCRM and simulation studies. The results highlight the improved safety, accuracy, and reduced occurrence of estimate incoherency in trials applying the BMA-POCRM relative to the POCRM model.

I期临床试验对于将新疗法从化学研发推向广泛应用至关重要。传统的I期试验剂量测定方法，如“3 + 3”方法和持续重新评估方法（CRM），提供了一种跨剂量水平递增的原则性方法。然而，这些方法缺乏结合在联合药物试验中发现的剂量-毒性顺序的不确定性的能力。在这种情况下，多种药物的剂量水平同时变化，导致多种可能的剂量毒性顺序。部分排序的CRM （POCRM）通过允许多个剂量毒性排序扩展到这些设置。在这项工作中，研究表明POCRM容易受到“估计不连贯”的影响，即毒性估计以一种不合逻辑的方式转移，威胁患者安全并破坏临床医生对剂量发现模型的信任。为此，将贝叶斯模型平均POCRM （BMA-POCRM）形式化。BMA-POCRM使用贝叶斯模型平均同时考虑所有可能的排序，减少了估计不相干的频率。我们对POCRM和BMA-POCRM的估计相干性给出了新的理论保证。BMA-POCRM在药物联合环境中的有效性通过POCRM的估计不一致性和仿真研究的具体实例得到了证明。结果表明，相对于POCRM模型，应用BMA-POCRM的试验提高了安全性、准确性，并减少了估计不一致性的发生。

{"title":"Bayesian model averaging for partial ordering continual reassessment methods.","authors":"Luka Kovačević, Weishi Chen, Helen Barnett, Thomas Jaki, Pavel Mozgunov","doi":"10.1093/biostatistics/kxaf035","DOIUrl":"10.1093/biostatistics/kxaf035","url":null,"abstract":"Phase I clinical trials are essential to bringing novel therapies from chemical development to widespread use. Traditional approaches to dose-finding in Phase I trials, such as the '3 + 3' method and the continual reassessment method (CRM), provide a principled approach for escalating across dose levels. However, these methods lack the ability to incorporate uncertainty regarding the dose-toxicity ordering as found in combination drug trials. Under this setting, dose levels vary across multiple drugs simultaneously, leading to multiple possible dose-toxicity orderings. The CRM for partial ordering (POCRM) extends to these settings by allowing for multiple dose-toxicity orderings. In this work, it is shown that the POCRM is vulnerable to 'estimation incoherency' whereby toxicity estimates shift in an illogical way, threatening patient safety and undermining clinician trust in dose-finding models. To this end, the Bayesian model averaged POCRM (BMA-POCRM) is formalized. BMA-POCRM uses Bayesian model averaging to take into account all possible orderings simultaneously, reducing the frequency of estimation incoherencies. We derive novel theoretical guarantees on the estimation coherency of the POCRM and BMA-POCRM. The effectiveness of BMA-POCRM in drug combination settings is demonstrated through a specific instance of estimate incoherency of POCRM and simulation studies. The results highlight the improved safety, accuracy, and reduced occurrence of estimate incoherency in trials applying the BMA-POCRM relative to the POCRM model.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12538209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145338280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Variable-based probabilistic calibration with binary outcome. 二元结果的基于变量的概率校准。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf026

Hiroe Seto, Shuji Kitora, Asuka Oyama, Hiroshi Toki, Ryohei Yamamoto, Michio Yamamoto

In developing risk prediction models for specific diseases, it is essential to evaluate the calibration performance of the prediction model. Various methods have been proposed to assess the calibration of prediction models, but it has been pointed out that conventional methods based on the predicted probability of the model are insufficient to detect miscalibration. Another problem is that a method for evaluating calibration for continuous variables of interest has not yet been established. We therefore propose two methods to evaluate the calibration of the variable of interest: the variable-based probabilistic calibration plot (VPC-Plot), which is a visual assessment, and the variable-based probabilistic calibration error (VPCE), which is a corresponding evaluation metric. We conducted theoretical and simulation studies to investigate the properties and effectiveness of the proposed method. Theoretical and simulation studies demonstrated that the proposed methods can detect miscalibration by evaluating the calibration based on the variable of interest, even when conventional methods fail to detect miscalibration. To show the usefulness in the real-world data analysis, we evaluated diabetes prediction models developed using the national health insurance database for Osaka, Japan. We show that the proposed method can identify miscalibration of key covariate in a diabetes prediction model.

在建立特定疾病的风险预测模型时，评估预测模型的校准性能至关重要。人们提出了各种方法来评估预测模型的校准，但有人指出，传统的基于模型预测概率的方法不足以检测误校准。另一个问题是，对感兴趣的连续变量的校准评估方法尚未建立。因此，我们提出了两种评估感兴趣变量校准的方法：基于变量的概率校准图（vc - plot），这是一种视觉评估，以及基于变量的概率校准误差（VPCE），这是一个相应的评估指标。我们进行了理论和仿真研究，以调查所提出的方法的性质和有效性。理论和仿真研究表明，即使传统方法无法检测到误校准，该方法也可以通过评估基于感兴趣变量的校准来检测误校准。为了显示在现实世界数据分析中的有用性，我们评估了使用日本大阪国家健康保险数据库开发的糖尿病预测模型。我们表明，该方法可以识别糖尿病预测模型中关键协变量的误校正。

{"title":"Variable-based probabilistic calibration with binary outcome.","authors":"Hiroe Seto, Shuji Kitora, Asuka Oyama, Hiroshi Toki, Ryohei Yamamoto, Michio Yamamoto","doi":"10.1093/biostatistics/kxaf026","DOIUrl":"10.1093/biostatistics/kxaf026","url":null,"abstract":"In developing risk prediction models for specific diseases, it is essential to evaluate the calibration performance of the prediction model. Various methods have been proposed to assess the calibration of prediction models, but it has been pointed out that conventional methods based on the predicted probability of the model are insufficient to detect miscalibration. Another problem is that a method for evaluating calibration for continuous variables of interest has not yet been established. We therefore propose two methods to evaluate the calibration of the variable of interest: the variable-based probabilistic calibration plot (VPC-Plot), which is a visual assessment, and the variable-based probabilistic calibration error (VPCE), which is a corresponding evaluation metric. We conducted theoretical and simulation studies to investigate the properties and effectiveness of the proposed method. Theoretical and simulation studies demonstrated that the proposed methods can detect miscalibration by evaluating the calibration based on the variable of interest, even when conventional methods fail to detect miscalibration. To show the usefulness in the real-world data analysis, we evaluated diabetes prediction models developed using the national health insurance database for Osaka, Japan. We show that the proposed method can identify miscalibration of key covariate in a diabetes prediction model.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145314132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wastewater-based reproduction rates for epidemic curve reconstruction. 流行病曲线重建中基于废水的繁殖率。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf033

Emily Somerset, Justin J Slater, Patrick E Brown

We introduce a hierarchical Bayesian framework for reconstructing epidemic curves using under-reported case counts and wastewater data. Our approach models wastewater signals as differentiable Gaussian processes, enabling inference on their relative growth rates, which are used to define a wastewater-based reproduction rate. These estimates are incorporated into a binomially thinned Poisson autoregressive model for case counts using a modular inference strategy. We apply this framework to reconstruct the Covid-19 epidemic curve in Toronto, validating our model through out-of-sample forecasts and comparisons with independent serosurvey-based cumulative incidence estimates. We also apply the framework to New Zealand's Covid-19 data to reconstruct its epidemic curve and demonstrate improvements over an existing joint model for wastewater and case data. A key advantage of our framework, highlighted in this comparison, is that it does not rely on pre-specified constant parameters, allowing the model to better adapt to evolving pandemic conditions.

我们引入了一个层次贝叶斯框架，用于利用未报告的病例数和废水数据重建流行病曲线。我们的方法将废水信号建模为可微的高斯过程，从而可以推断其相对增长率，从而用于定义基于废水的繁殖率。这些估计被纳入一个二项稀释泊松自回归模型的情况下计数使用模块化推理策略。我们将该框架应用于重建多伦多的Covid-19流行曲线，通过样本外预测和与基于独立血清调查的累积发病率估计的比较来验证我们的模型。我们还将该框架应用于新西兰的Covid-19数据，以重建其流行曲线，并展示对现有废水和病例数据联合模型的改进。这一比较突出表明，我们的框架的一个关键优势是，它不依赖于预先指定的恒定参数，从而使模型能够更好地适应不断变化的大流行情况。

引用次数: 0

Multi-study R-learner for estimating heterogeneous treatment effects across studies using statistical machine learning. 多研究r学习器，用于使用统计机器学习估计跨研究的异质治疗效果。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf040

Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani

Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.

异质性治疗效应（Heterogeneous treatment effect， HTE）是指群体中个体治疗效果的非随机、可解释的变异。HTE估计是精准医疗的核心，准确的效果估计可以为个性化治疗决策提供信息。在实践中，患者可以呈现与多个研究重叠的协变量概况，这增加了在多研究环境中为治疗决策提供最佳信息的挑战。我们提出了一个灵活的统计机器学习（ML）框架，即多研究R学习器，它利用多个研究来估计HTE。现有的多研究方法通常假设研究特异性(i)条件平均治疗效果（CATE），（ii）在没有给定协变量的治疗下的预期潜在结果，以及（iii）治疗分配机制在研究中是相同的，但由于研究人群、方案或设计的差异，这些假设在实践中可能不成立。为此，我们开发了我们的框架来直接解释这三种类型的研究间异质性。它建立在交叉研究学习的最新进展基础上，并使用数据自适应目标函数，通过隶属关系概率将交叉研究中妨害函数的估计与研究特定的CATEs结合起来，从而使信息能够跨研究借鉴。多学习$ R $学习器将$ R $学习器扩展到多学习环境，并且在结合ML技术方面具有灵活性。在序列估计框架中，我们证明了所提出的方法是渐近正态的，并且在治疗分配机制存在研究间异质性时比$ R $学习器更有效。我们使用随机对照试验和观察性研究的癌症数据说明，在研究间异质性存在的情况下，多研究$ R $学习器表现良好。

{"title":"Multi-study R-learner for estimating heterogeneous treatment effects across studies using statistical machine learning.","authors":"Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani","doi":"10.1093/biostatistics/kxaf040","DOIUrl":"10.1093/biostatistics/kxaf040","url":null,"abstract":"Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12713001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian scalar-on-tensor regression using the Tucker decomposition for sparse spatial modeling. 利用Tucker分解的贝叶斯张量标量回归进行稀疏空间建模。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf029

Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado

Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.

使用多维数组或张量进行建模通常会由于高维性而出现问题。此外，这些结构通常表现出固有的稀疏性，需要使用正则化方法来适当地表征张量协变量和标量响应之间的关联。为了保留张量系数内的空间关系，我们提出了一种贝叶斯方法来有效地用张量协变量建模标量响应，同时减少模型内变化的参数数量并应用正则化方法。对模拟数据进行了分析，将该模型与最近提出的方法进行了比较。使用阿尔茨海默氏症数据神经成像计划数据的神经成像分析显示，与其他张量回归方法相比，推理性能有所提高。贝叶斯分析;张量分解;图像分析;空间数据;统计建模。

引用次数: 0

Network generalized estimating equations for complexly correlated data with applications to cluster randomized trials. 复杂相关数据的网络广义估计方程及其在聚类随机试验中的应用。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf039

Tom Chen, Fan Li, Rui Wang

Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.

在聚类随机试验（crt）中，估计与平均结果相对应的参数及其复杂的关联结构可能会带来重大的方法学挑战。本文介绍了一个新的框架，利用网络概念来表示复杂的依赖结构，并使用广义估计方程（GEE）来估计这些参数。我们通过将观测数据划分为相互关联数据的潜在重叠组来关注复杂相关结构的建模，其中观测数据假设在每个组内可局部交换。该网络GEE框架具有固有的灵活性，并演示了其在多个可交换结构（简单、嵌套、块）、移动平均结构和指数衰减结构中的应用。此外，为了解决在具有大集群规模的GEEs中出现的计算挑战，我们提出了networkGEE R包，使模型的拟合超出了现有统计软件的能力。通过广泛的模拟研究对所提出的方法进行了评估。为了说明它们的实际应用，我们分析了华盛顿州加速伴侣治疗试验的数据，这是一项旨在评估公共卫生干预的影响，旨在通过患者提供的免费伴侣治疗来减少性传播感染。

{"title":"Network generalized estimating equations for complexly correlated data with applications to cluster randomized trials.","authors":"Tom Chen, Fan Li, Rui Wang","doi":"10.1093/biostatistics/kxaf039","DOIUrl":"10.1093/biostatistics/kxaf039","url":null,"abstract":"Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12766921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mediation analysis with graph mediator. 使用图中介的中介分析。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf004

Yixi Xu, Yi Zhao

This study introduces a mediation analysis framework when the mediator is a graph. A Gaussian covariance graph model is assumed for graph presentation. Causal estimands and assumptions are discussed under this presentation. With a covariance matrix as the mediator, a low-rank representation is introduced and parametric mediation models are considered under the structural equation modeling framework. Assuming Gaussian random errors, likelihood-based estimators are introduced to simultaneously identify the low-rank representation and causal parameters. An efficient computational algorithm is proposed and asymptotic properties of the estimators are investigated. Via simulation studies, the performance of the proposed approach is evaluated. Applying to a resting-state fMRI study, a brain network is identified within which functional connectivity mediates the sex difference in the performance of a motor task.

本研究引入了一个以图为中介的中介分析框架。图的表示采用高斯协方差图模型。本报告将讨论因果估计和假设。以协方差矩阵为中介，引入低秩表示，在结构方程建模框架下考虑参数化中介模型。在假设高斯随机误差的情况下，引入基于似然的估计器来同时识别低秩表示和因果参数。提出了一种有效的计算算法，并研究了估计量的渐近性质。通过仿真研究，对该方法的性能进行了评价。应用静息状态fMRI研究，确定了一个大脑网络，其中功能连接介导了运动任务表现的性别差异。

引用次数: 0

Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease. 多视图数据集成与预测的可扩展随机核方法及其在冠状病毒病中的应用。

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2025-12-31 DOI: 10.1093/biostatistics/kxaf001

Sandra E Safo, Han Lu

There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.

尽管大流行已经过去了4年，但关于冠状病毒病（COVID-19）的病理生物学，我们还有更多需要了解的。多组学方法提供了对该疾病的全面看法，并有可能对该疾病的发病机制产生更深入的了解。以往对COVID-19严重程度和病情的多组学综合分析和预测研究假设组学数据之间以及组学与COVID-19结局之间存在简单关系（即线性关系）。然而，这些线性方法并没有考虑到与这些不同类型的数据相关的固有的潜在非线性结构。这项工作背后的动机是模拟多组学和COVID-19结果之间的非线性关系，并确定与该疾病相关的关键多维分子。为了实现这一目标，我们开发了可扩展的随机核方法，用于联合关联来自多个来源或视图的数据，并同时预测结果或将单元分类为两个或多个类之一。我们还确定最有助于视图之间关系的变量或变量组。我们使用随机傅里叶基可以近似移位不变核函数的思想来构造每个视图的非线性映射，并使用这些映射和结果变量来学习与视图无关的低维表示。我们通过大量的仿真证明了所提出方法的有效性。将所提出的方法应用于与COVID-19相关的基因表达、代谢组学、蛋白质组学和脂质组学数据时，我们确定了COVID-19状态和严重程度的几个分子特征。我们的结果与先前的发现一致，并为未来的研究提供了潜在的途径。我们的算法是在Pytorch中实现的，并在R中接口，可在：https://github.com/lasandrall/RandMVLearn。

{"title":"Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease.","authors":"Sandra E Safo, Han Lu","doi":"10.1093/biostatistics/kxaf001","DOIUrl":"10.1093/biostatistics/kxaf001","url":null,"abstract":"There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839864/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143460884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0