Biostatistics最新文献_第9页

DeLIVR: a deep learning approach to IV regression for testing nonlinear causal effects in transcriptome-wide association studies. DeLIVR：在全转录组关联研究中测试非线性因果效应的深度学习 IV 回归方法。

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2024-04-15 DOI: 10.1093/biostatistics/kxac051

Ruoyu He, Mingyang Liu, Zhaotong Lin, Zhong Zhuang, Xiaotong Shen, Wei Pan

Transcriptome-wide association studies (TWAS) have been increasingly applied to identify (putative) causal genes for complex traits and diseases. TWAS can be regarded as a two-sample two-stage least squares method for instrumental variable (IV) regression for causal inference. The standard TWAS (called TWAS-L) only considers a linear relationship between a gene's expression and a trait in stage 2, which may lose statistical power when not true. Recently, an extension of TWAS (called TWAS-LQ) considers both the linear and quadratic effects of a gene on a trait, which however is not flexible enough due to its parametric nature and may be low powered for nonquadratic nonlinear effects. On the other hand, a deep learning (DL) approach, called DeepIV, has been proposed to nonparametrically model a nonlinear effect in IV regression. However, it is both slow and unstable due to the ill-posed inverse problem of solving an integral equation with Monte Carlo approximations. Furthermore, in the original DeepIV approach, statistical inference, that is, hypothesis testing, was not studied. Here, we propose a novel DL approach, called DeLIVR, to overcome the major drawbacks of DeepIV, by estimating a related but different target function and including a hypothesis testing framework. We show through simulations that DeLIVR was both faster and more stable than DeepIV. We applied both parametric and DL approaches to the GTEx and UK Biobank data, showcasing that DeLIVR detected additional 8 and 7 genes nonlinearly associated with high-density lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol, respectively, all of which would be missed by TWAS-L, TWAS-LQ, and DeepIV; these genes include BUD13 associated with HDL, SLC44A2 and GMIP with LDL, all supported by previous studies.

全转录组关联研究（TWAS）越来越多地被用于识别复杂性状和疾病的（推定）因果基因。TWAS 可被视为用于因果推断的工具变量（IV）回归的两抽样两阶段最小二乘法。标准的 TWAS（称为 TWAS-L）在第二阶段只考虑基因表达与性状之间的线性关系，当这种关系不真实时，可能会失去统计能力。最近，TWAS 的一种扩展（称为 TWAS-LQ）同时考虑了基因对性状的线性和二次方效应，但由于其参数化的性质，这种方法不够灵活，而且对于非二次方的非线性效应的统计能力可能较低。另一方面，有人提出了一种名为 DeepIV 的深度学习（DL）方法，用于对 IV 回归中的非线性效应进行非参数建模。然而，由于需要用蒙特卡罗近似法求解积分方程的逆问题，该方法既慢又不稳定。此外，在最初的 DeepIV 方法中，没有研究统计推断，即假设检验。在这里，我们提出了一种名为 DeLIVR 的新型 DL 方法，通过估计一个相关但不同的目标函数并包含一个假设检验框架，克服了 DeepIV 的主要缺点。我们通过仿真表明，DeLIVR 比 DeepIV 更快、更稳定。我们在 GTEx 和英国生物库数据中应用了参数法和 DL 法，结果表明 DeLIVR 分别检测出了额外的 8 个和 7 个与高密度脂蛋白胆固醇和低密度脂蛋白胆固醇非线性相关的基因，而 TWAS-L、TWAS-LQ 和 DeepIV 会漏掉所有这些基因；这些基因包括与高密度脂蛋白相关的 BUD13、与低密度脂蛋白相关的 SLC44A2 和 GMIP，所有这些都得到了以前研究的支持。

{"title":"DeLIVR: a deep learning approach to IV regression for testing nonlinear causal effects in transcriptome-wide association studies.","authors":"Ruoyu He, Mingyang Liu, Zhaotong Lin, Zhong Zhuang, Xiaotong Shen, Wei Pan","doi":"10.1093/biostatistics/kxac051","DOIUrl":"10.1093/biostatistics/kxac051","url":null,"abstract":"Transcriptome-wide association studies (TWAS) have been increasingly applied to identify (putative) causal genes for complex traits and diseases. TWAS can be regarded as a two-sample two-stage least squares method for instrumental variable (IV) regression for causal inference. The standard TWAS (called TWAS-L) only considers a linear relationship between a gene's expression and a trait in stage 2, which may lose statistical power when not true. Recently, an extension of TWAS (called TWAS-LQ) considers both the linear and quadratic effects of a gene on a trait, which however is not flexible enough due to its parametric nature and may be low powered for nonquadratic nonlinear effects. On the other hand, a deep learning (DL) approach, called DeepIV, has been proposed to nonparametrically model a nonlinear effect in IV regression. However, it is both slow and unstable due to the ill-posed inverse problem of solving an integral equation with Monte Carlo approximations. Furthermore, in the original DeepIV approach, statistical inference, that is, hypothesis testing, was not studied. Here, we propose a novel DL approach, called DeLIVR, to overcome the major drawbacks of DeepIV, by estimating a related but different target function and including a hypothesis testing framework. We show through simulations that DeLIVR was both faster and more stable than DeepIV. We applied both parametric and DL approaches to the GTEx and UK Biobank data, showcasing that DeLIVR detected additional 8 and 7 genes nonlinearly associated with high-density lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol, respectively, all of which would be missed by TWAS-L, TWAS-LQ, and DeepIV; these genes include BUD13 associated with HDL, SLC44A2 and GMIP with LDL, all supported by previous studies.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"468-485"},"PeriodicalIF":2.1,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017120/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10861888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Differential transcript usage analysis incorporating quantification uncertainty via compositional measurement error regression modeling. 通过成分测量误差回归建模纳入量化不确定性的差异转录本使用分析。

IF 1.8 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2024-04-15 DOI: 10.1093/biostatistics/kxad008

Amber M Young, Scott Van Buren, Naim U Rashid

Differential transcript usage (DTU) occurs when the relative expression of multiple transcripts arising from the same gene changes between different conditions. Existing approaches to detect DTU often rely on computational procedures that can have speed and scalability issues as the number of samples increases. Here we propose a new method, CompDTU, that uses compositional regression to model the relative abundance proportions of each transcript that are of interest in DTU analyses. This procedure leverages fast matrix-based computations that make it ideally suited for DTU analysis with larger sample sizes. This method also allows for the testing of and adjustment for multiple categorical or continuous covariates. Additionally, many existing approaches for DTU ignore quantification uncertainty in the expression estimates for each transcript in RNA-seq data. We extend our CompDTU method to incorporate quantification uncertainty leveraging common output from RNA-seq expression quantification tool in a novel method CompDTUme. Through several power analyses, we show that CompDTU has excellent sensitivity and reduces false positive results relative to existing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU with sufficient sample size for genes with high levels of quantification uncertainty, while also maintaining favorable speed and scalability. We motivate our methods using data from the Cancer Genome Atlas Breast Invasive Carcinoma data set, specifically using RNA-seq data from primary tumors for 740 patients with breast cancer. We show greatly reduced computation time from our new methods as well as the ability to detect several novel genes with significant DTU across different breast cancer subtypes.

当同一基因产生的多个转录本的相对表达量在不同条件下发生变化时，就会出现转录本使用差异（DTU）。现有的 DTU 检测方法通常依赖于计算程序，随着样本数量的增加，计算速度和可扩展性都会出现问题。在这里，我们提出了一种新方法 CompDTU，它使用成分回归来模拟 DTU 分析中感兴趣的每个转录本的相对丰度比例。该方法利用基于矩阵的快速计算，非常适合较大样本量的 DTU 分析。这种方法还可以测试和调整多个分类或连续协变量。此外，许多现有的 DTU 方法都忽略了 RNA-seq 数据中每个转录本表达估计值的量化不确定性。我们扩展了 CompDTU 方法，利用新方法 CompDTUme 中 RNA-seq 表达定量工具的常见输出，将定量不确定性纳入其中。通过几项功率分析，我们发现 CompDTU 与现有方法相比，灵敏度极高，并能减少假阳性结果。此外，与 CompDTU 相比，CompDTUme 的性能有了进一步提高，对于定量不确定性较高的基因，CompDTUme 有足够的样本量，同时还保持了良好的速度和可扩展性。我们利用癌症基因组图谱乳腺浸润性癌数据集的数据，特别是 740 名乳腺癌患者原发肿瘤的 RNA 序列数据，对我们的方法进行了演示。结果表明，我们的新方法大大缩短了计算时间，并能在不同的乳腺癌亚型中检测出几个具有显著 DTU 的新基因。

{"title":"Differential transcript usage analysis incorporating quantification uncertainty via compositional measurement error regression modeling.","authors":"Amber M Young, Scott Van Buren, Naim U Rashid","doi":"10.1093/biostatistics/kxad008","DOIUrl":"10.1093/biostatistics/kxad008","url":null,"abstract":"Differential transcript usage (DTU) occurs when the relative expression of multiple transcripts arising from the same gene changes between different conditions. Existing approaches to detect DTU often rely on computational procedures that can have speed and scalability issues as the number of samples increases. Here we propose a new method, CompDTU, that uses compositional regression to model the relative abundance proportions of each transcript that are of interest in DTU analyses. This procedure leverages fast matrix-based computations that make it ideally suited for DTU analysis with larger sample sizes. This method also allows for the testing of and adjustment for multiple categorical or continuous covariates. Additionally, many existing approaches for DTU ignore quantification uncertainty in the expression estimates for each transcript in RNA-seq data. We extend our CompDTU method to incorporate quantification uncertainty leveraging common output from RNA-seq expression quantification tool in a novel method CompDTUme. Through several power analyses, we show that CompDTU has excellent sensitivity and reduces false positive results relative to existing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU with sufficient sample size for genes with high levels of quantification uncertainty, while also maintaining favorable speed and scalability. We motivate our methods using data from the Cancer Genome Atlas Breast Invasive Carcinoma data set, specifically using RNA-seq data from primary tumors for 740 patients with breast cancer. We show greatly reduced computation time from our new methods as well as the ability to detect several novel genes with significant DTU across different breast cancer subtypes.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"559-576"},"PeriodicalIF":1.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017126/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9683536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast and flexible inference for joint models of multivariate longitudinal and survival data using integrated nested Laplace approximations. 使用集成嵌套拉普拉斯近似值快速灵活地推断多变量纵向和生存数据的联合模型

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2024-04-15 DOI: 10.1093/biostatistics/kxad019

Denis Rustand, Janet van Niekerk, Elias Teixeira Krainski, Håvard Rue, Cécile Proust-Lima

Modeling longitudinal and survival data jointly offers many advantages such as addressing measurement error and missing data in the longitudinal processes, understanding and quantifying the association between the longitudinal markers and the survival events, and predicting the risk of events based on the longitudinal markers. A joint model involves multiple submodels (one for each longitudinal/survival outcome) usually linked together through correlated or shared random effects. Their estimation is computationally expensive (particularly due to a multidimensional integration of the likelihood over the random effects distribution) so that inference methods become rapidly intractable, and restricts applications of joint models to a small number of longitudinal markers and/or random effects. We introduce a Bayesian approximation based on the integrated nested Laplace approximation algorithm implemented in the R package R-INLA to alleviate the computational burden and allow the estimation of multivariate joint models with fewer restrictions. Our simulation studies show that R-INLA substantially reduces the computation time and the variability of the parameter estimates compared with alternative estimation strategies. We further apply the methodology to analyze five longitudinal markers (3 continuous, 1 count, 1 binary, and 16 random effects) and competing risks of death and transplantation in a clinical trial on primary biliary cholangitis. R-INLA provides a fast and reliable inference technique for applying joint models to the complex multivariate data encountered in health research.

对纵向数据和生存数据进行联合建模具有很多优势，例如可以解决纵向过程中的测量误差和数据缺失问题，了解并量化纵向标记与生存事件之间的关联，以及根据纵向标记预测事件风险。联合模型涉及多个子模型（每个纵向/生存结果一个），通常通过相关或共享随机效应连接在一起。联合模型的估算耗资巨大（特别是由于对随机效应分布进行多维度的似然积分），因此推断方法很快就会变得难以处理，并将联合模型的应用限制在少量纵向标记物和/或随机效应上。我们引入了一种基于集成嵌套拉普拉斯近似算法的贝叶斯近似方法，该算法在 R 软件包 R-INLA 中实现，从而减轻了计算负担，并允许在较少限制的情况下估计多变量联合模型。我们的模拟研究表明，与其他估计策略相比，R-INLA 大幅减少了计算时间和参数估计值的变异性。我们进一步应用该方法分析了原发性胆管炎临床试验中的五个纵向标记物（3 个连续标记物、1 个计数标记物、1 个二元标记物和 16 个随机效应）以及死亡和移植的竞争风险。R-INLA 提供了一种快速可靠的推断技术，可将联合模型应用于健康研究中遇到的复杂多变量数据。

{"title":"Fast and flexible inference for joint models of multivariate longitudinal and survival data using integrated nested Laplace approximations.","authors":"Denis Rustand, Janet van Niekerk, Elias Teixeira Krainski, Håvard Rue, Cécile Proust-Lima","doi":"10.1093/biostatistics/kxad019","DOIUrl":"10.1093/biostatistics/kxad019","url":null,"abstract":"Modeling longitudinal and survival data jointly offers many advantages such as addressing measurement error and missing data in the longitudinal processes, understanding and quantifying the association between the longitudinal markers and the survival events, and predicting the risk of events based on the longitudinal markers. A joint model involves multiple submodels (one for each longitudinal/survival outcome) usually linked together through correlated or shared random effects. Their estimation is computationally expensive (particularly due to a multidimensional integration of the likelihood over the random effects distribution) so that inference methods become rapidly intractable, and restricts applications of joint models to a small number of longitudinal markers and/or random effects. We introduce a Bayesian approximation based on the integrated nested Laplace approximation algorithm implemented in the R package R-INLA to alleviate the computational burden and allow the estimation of multivariate joint models with fewer restrictions. Our simulation studies show that R-INLA substantially reduces the computation time and the variability of the parameter estimates compared with alternative estimation strategies. We further apply the methodology to analyze five longitudinal markers (3 continuous, 1 count, 1 binary, and 16 random effects) and competing risks of death and transplantation in a clinical trial on primary biliary cholangitis. R-INLA provides a fast and reliable inference technique for applying joint models to the complex multivariate data encountered in health research.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"429-448"},"PeriodicalIF":2.1,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10301894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Practical causal mediation analysis: extending nonparametric estimators to accommodate multiple mediators and multiple intermediate confounders 实用因果中介分析：扩展非参数估算器，以适应多个中介因素和多个中间混杂因素

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2024-04-05 DOI: 10.1093/biostatistics/kxae012

Kara E Rudolph, Nicholas T Williams, Ivan Diaz

Mediation analysis is appealing for its ability to improve understanding of the mechanistic drivers of causal effects, but real-world data complexities challenge its successful implementation, including (i) the existence of post-exposure variables that also affect mediators and outcomes (thus, confounding the mediator-outcome relationship), that may also be (ii) multivariate, and (iii) the existence of multivariate mediators. All three challenges are present in the mediation analysis we consider here, where our goal is to estimate the indirect effects of receiving a Section 8 housing voucher as a young child on the risk of developing a psychiatric mood disorder in adolescence that operate through mediators related to neighborhood poverty, the school environment, and instability of the neighborhood and school environments, considered together and separately. Interventional direct and indirect effects (IDE/IIE) accommodate post-exposure variables that confound the mediator–outcome relationship, but currently, no readily implementable nonparametric estimator for IDE/IIE exists that allows for both multivariate mediators and multivariate post-exposure intermediate confounders. The absence of such an IDE/IIE estimator that can easily accommodate both multivariate mediators and post-exposure confounders represents a significant limitation for real-world analyses, because when considering each mediator subgroup separately, the remaining mediator subgroups (or a subset of them) become post-exposure intermediate confounders. We address this gap by extending a recently developed nonparametric estimator for the IDE/IIE to allow for easy incorporation of multivariate mediators and multivariate post-exposure confounders simultaneously. We apply the proposed estimation approach to our analysis, including walking through a strategy to account for other, possibly co-occurring intermediate variables when considering each mediator subgroup separately.

中介分析能够提高人们对因果效应机理驱动因素的理解，因此很有吸引力，但现实世界中数据的复杂性给中介分析的成功实施带来了挑战，其中包括：(i) 存在同时影响中介因素和结果的暴露后变量（从而混淆了中介因素与结果之间的关系），这些变量也可能是(ii) 多变量的；(iii) 存在多变量中介因素。在这里，我们的目标是估算幼年时获得第 8 款住房券对青春期患精神情绪障碍风险的间接影响，这种影响通过与邻里贫困、学校环境以及邻里和学校环境的不稳定性相关的中介因素（同时考虑和单独考虑）来实现。干预直接效应和间接效应（IDE/IIE）包含了混淆介导因素-结果关系的暴露后变量，但目前还没有一种易于实施的 IDE/IIE 非参数估计器，可以同时考虑多变量介导因素和多变量暴露后中间混杂因素。缺乏这样一种既能轻松容纳多变量介导因素又能容纳暴露后混杂因素的 IDE/IIE 估计器对于实际分析来说是一个重大限制，因为当单独考虑每个介导因素亚组时，其余的介导因素亚组（或其子集）就会成为暴露后的中间混杂因素。为了弥补这一不足，我们扩展了最近开发的 IDE/IIE 非参数估计方法，以便于同时纳入多变量介导因素和多变量暴露后混杂因素。我们在分析中应用了所提出的估计方法，包括在单独考虑每个中介亚组时，考虑其他可能同时存在的中间变量的策略。

{"title":"Practical causal mediation analysis: extending nonparametric estimators to accommodate multiple mediators and multiple intermediate confounders","authors":"Kara E Rudolph, Nicholas T Williams, Ivan Diaz","doi":"10.1093/biostatistics/kxae012","DOIUrl":"https://doi.org/10.1093/biostatistics/kxae012","url":null,"abstract":"Mediation analysis is appealing for its ability to improve understanding of the mechanistic drivers of causal effects, but real-world data complexities challenge its successful implementation, including (i) the existence of post-exposure variables that also affect mediators and outcomes (thus, confounding the mediator-outcome relationship), that may also be (ii) multivariate, and (iii) the existence of multivariate mediators. All three challenges are present in the mediation analysis we consider here, where our goal is to estimate the indirect effects of receiving a Section 8 housing voucher as a young child on the risk of developing a psychiatric mood disorder in adolescence that operate through mediators related to neighborhood poverty, the school environment, and instability of the neighborhood and school environments, considered together and separately. Interventional direct and indirect effects (IDE/IIE) accommodate post-exposure variables that confound the mediator–outcome relationship, but currently, no readily implementable nonparametric estimator for IDE/IIE exists that allows for both multivariate mediators and multivariate post-exposure intermediate confounders. The absence of such an IDE/IIE estimator that can easily accommodate both multivariate mediators and post-exposure confounders represents a significant limitation for real-world analyses, because when considering each mediator subgroup separately, the remaining mediator subgroups (or a subset of them) become post-exposure intermediate confounders. We address this gap by extending a recently developed nonparametric estimator for the IDE/IIE to allow for easy incorporation of multivariate mediators and multivariate post-exposure confounders simultaneously. We apply the proposed estimation approach to our analysis, including walking through a strategy to account for other, possibly co-occurring intermediate variables when considering each mediator subgroup separately.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"98 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140579180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable kernel balancing weights in a nationwide observational study of hospital profit status and heart attack outcomes 医院盈利状况与心脏病发作结果的全国性观察研究中的可扩展内核平衡权重

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2023-12-21 DOI: 10.1093/biostatistics/kxad032

Kwangho Kim, Bijan A Niknam, José R Zubizarreta

Summary Weighting is a general and often-used method for statistical adjustment. Weighting has two objectives: first, to balance covariate distributions, and second, to ensure that the weights have minimal dispersion and thus produce a more stable estimator. A recent, increasingly common approach directly optimizes the weights toward these two objectives. However, this approach has not yet been feasible in large-scale datasets when investigators wish to flexibly balance general basis functions in an extended feature space. To address this practical problem, we describe a scalable and flexible approach to weighting that integrates a basis expansion in a reproducing kernel Hilbert space with state-of-the-art convex optimization techniques. Specifically, we use the rank-restricted Nyström method to efficiently compute a kernel basis for balancing in nearly linear time and space, and then use the specialized first-order alternating direction method of multipliers to rapidly find the optimal weights. In an extensive simulation study, we provide new insights into the performance of weighting estimators in large datasets, showing that the proposed approach substantially outperforms others in terms of accuracy and speed. Finally, we use this weighting approach to conduct a national study of the relationship between hospital profit status and heart attack outcomes in a comprehensive dataset of 1.27 million patients. We find that for-profit hospitals use interventional cardiology to treat heart attacks at similar rates as other hospitals but have higher mortality and readmission rates.

摘要加权是一种常用的统计调整方法。加权有两个目的：第一，平衡协变量的分布；第二，确保权重的离散性最小，从而产生更稳定的估计值。最近，一种越来越常见的方法是直接优化权重，以实现这两个目标。然而，当研究人员希望在扩展特征空间中灵活平衡一般基函数时，这种方法在大规模数据集中还不可行。为了解决这个实际问题，我们介绍了一种可扩展的灵活加权方法，它将再现核希尔伯特空间中的基扩展与最先进的凸优化技术整合在一起。具体来说，我们使用秩限制 Nyström 方法，在近乎线性的时间和空间内高效计算出用于平衡的核基，然后使用专门的一阶交替方向乘法快速找到最佳权重。在一项广泛的模拟研究中，我们对大型数据集中加权估计器的性能提出了新的见解，表明所提出的方法在准确性和速度方面大大优于其他方法。最后，我们利用这种加权方法，在一个包含 127 万名患者的综合数据集中对医院盈利状况与心脏病发作结果之间的关系进行了全国性研究。我们发现，营利性医院使用介入心脏病学治疗心脏病的比例与其他医院相似，但死亡率和再入院率较高。

{"title":"Scalable kernel balancing weights in a nationwide observational study of hospital profit status and heart attack outcomes","authors":"Kwangho Kim, Bijan A Niknam, José R Zubizarreta","doi":"10.1093/biostatistics/kxad032","DOIUrl":"https://doi.org/10.1093/biostatistics/kxad032","url":null,"abstract":"Summary Weighting is a general and often-used method for statistical adjustment. Weighting has two objectives: first, to balance covariate distributions, and second, to ensure that the weights have minimal dispersion and thus produce a more stable estimator. A recent, increasingly common approach directly optimizes the weights toward these two objectives. However, this approach has not yet been feasible in large-scale datasets when investigators wish to flexibly balance general basis functions in an extended feature space. To address this practical problem, we describe a scalable and flexible approach to weighting that integrates a basis expansion in a reproducing kernel Hilbert space with state-of-the-art convex optimization techniques. Specifically, we use the rank-restricted Nyström method to efficiently compute a kernel basis for balancing in nearly linear time and space, and then use the specialized first-order alternating direction method of multipliers to rapidly find the optimal weights. In an extensive simulation study, we provide new insights into the performance of weighting estimators in large datasets, showing that the proposed approach substantially outperforms others in terms of accuracy and speed. Finally, we use this weighting approach to conduct a national study of the relationship between hospital profit status and heart attack outcomes in a comprehensive dataset of 1.27 million patients. We find that for-profit hospitals use interventional cardiology to treat heart attacks at similar rates as other hospitals but have higher mortality and readmission rates.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"28 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An imputation approach for a time-to-event analysis subject to missing outcomes due to noncoverage in disease registries. 针对因疾病登记未覆盖而导致结果缺失的时间到事件分析的估算方法。

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2023-12-15 DOI: 10.1093/biostatistics/kxac049

Joanna H Shih, Paul S Albert, Jason Fine, Danping Liu

Disease incidence data in a national-based cohort study would ideally be obtained through a national disease registry. Unfortunately, no such registry currently exists in the United States. Instead, the results from individual state registries need to be combined to ascertain certain disease diagnoses in the United States. The National Cancer Institute has initiated a program to assemble all state registries to provide a complete assessment of all cancers in the United States. Unfortunately, not all registries have agreed to participate. In this article, we develop an imputation-based approach that uses self-reported cancer diagnosis from longitudinally collected questionnaires to impute cancer incidence not covered by the combined registry. We propose a two-step procedure, where in the first step a mover-stayer model is used to impute a participant's registry coverage status when it is only reported at the time of the questionnaires given at 10-year intervals and the time of the last-alive vital status and death. In the second step, we propose a semiparametric working model, fit using an imputed coverage area sample identified from the mover-stayer model, to impute registry-based survival outcomes for participants in areas not covered by the registry. The simulation studies show the approach performs well as compared with alternative ad hoc approaches for dealing with this problem. We illustrate the methodology with an analysis that links the United States Radiologic Technologists study cohort with the combined registry that includes 32 of the 50 states.

以国家为基础的队列研究中的疾病发病率数据最好通过国家疾病登记处获得。遗憾的是，美国目前还没有这样的登记处。取而代之的是，需要将各州登记处的结果结合起来，才能确定美国某些疾病的诊断情况。美国国家癌症研究所已经启动了一项计划，将所有州的登记册汇总起来，以便对美国的所有癌症进行全面评估。遗憾的是，并非所有登记处都同意参与。在本文中，我们开发了一种基于估算的方法，利用纵向收集的问卷中自我报告的癌症诊断结果来估算合并登记处未涵盖的癌症发病率。我们提出了一个分两步走的程序，在第一步中，当参与者的登记覆盖状况仅在每 10 年一次的问卷调查中以及最后一次生命体征和死亡时报告时，我们将使用一个移动者-叠加模型来估算参与者的登记覆盖状况。在第二步中，我们提出了一个半参数工作模型，该模型使用从搬运工-叠加者模型中确定的推算覆盖地区样本进行拟合，以推算登记册未覆盖地区参与者的登记册生存结果。模拟研究表明，与处理该问题的其他临时方法相比，该方法表现良好。我们通过将美国放射技术人员研究队列与包括 50 个州中 32 个州的合并登记处联系起来的分析来说明该方法。

{"title":"An imputation approach for a time-to-event analysis subject to missing outcomes due to noncoverage in disease registries.","authors":"Joanna H Shih, Paul S Albert, Jason Fine, Danping Liu","doi":"10.1093/biostatistics/kxac049","DOIUrl":"10.1093/biostatistics/kxac049","url":null,"abstract":"Disease incidence data in a national-based cohort study would ideally be obtained through a national disease registry. Unfortunately, no such registry currently exists in the United States. Instead, the results from individual state registries need to be combined to ascertain certain disease diagnoses in the United States. The National Cancer Institute has initiated a program to assemble all state registries to provide a complete assessment of all cancers in the United States. Unfortunately, not all registries have agreed to participate. In this article, we develop an imputation-based approach that uses self-reported cancer diagnosis from longitudinally collected questionnaires to impute cancer incidence not covered by the combined registry. We propose a two-step procedure, where in the first step a mover-stayer model is used to impute a participant's registry coverage status when it is only reported at the time of the questionnaires given at 10-year intervals and the time of the last-alive vital status and death. In the second step, we propose a semiparametric working model, fit using an imputed coverage area sample identified from the mover-stayer model, to impute registry-based survival outcomes for participants in areas not covered by the registry. The simulation studies show the approach performs well as compared with alternative ad hoc approaches for dealing with this problem. We illustrate the methodology with an analysis that links the United States Radiologic Technologists study cohort with the combined registry that includes 32 of the 50 states.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"117-133"},"PeriodicalIF":2.1,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10939403/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10751930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inference after latent variable estimation for single-cell RNA sequencing data. 单细胞 RNA 测序数据的潜变量估计后推断。

IF 1.8 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2023-12-15 DOI: 10.1093/biostatistics/kxac047

Anna Neufeld, Lucy L Gao, Joshua Popp, Alexis Battle, Daniela Witten

In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.

在分析单细胞 RNA 测序数据时，研究人员通常会通过估计一个代表细胞状态某些方面的潜变量（如细胞类型或伪时间）来描述细胞之间的变化。然后，他们测试每个基因是否与估计的潜变量相关。如果在这两个步骤中使用相同的数据，那么在第二步中计算 p 值的标准方法将无法实现统计保证，如类型 1 错误控制。此外，在其他情况下可用于解决类似问题的方法，如样本拆分，在此情况下也不适用。在本文中，我们介绍了计数分割，这是一个灵活的框架，允许我们在泊松假设下，针对几乎所有的潜变量估计技术和推断方法，在这种情况下进行有效的推断。我们在模拟研究中演示了计数分割的第一类错误控制和威力，并将计数分割应用于多能干细胞分化为心肌细胞的数据集。

引用次数: 0

Differences in set-based tests for sparse alternatives when testing sets of outcomes compared to sets of explanatory factors in genetic association studies. 在遗传关联研究中，基于集合的稀疏替代品测试在测试结果集合与解释因素集合时的差异。

IF 1.8 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2023-12-15 DOI: 10.1093/biostatistics/kxac036

Ryan Sun, Andy Shi, Xihong Lin

Set-based association tests are widely popular in genetic association settings for their ability to aggregate weak signals and reduce multiple testing burdens. In particular, a class of set-based tests including the Higher Criticism, Berk-Jones, and other statistics have recently been popularized for reaching a so-called detection boundary when signals are rare and weak. Such tests have been applied in two subtly different settings: (a) associating a genetic variant set with a single phenotype and (b) associating a single genetic variant with a phenotype set. A significant issue in practice is the choice of test, especially when deciding between innovated and generalized type methods for detection boundary tests. Conflicting guidance is present in the literature. This work describes how correlation structures generate marked differences in relative operating characteristics for settings (a) and (b). The implications for study design are significant. We also develop novel power bounds that facilitate the aforementioned calculations and allow for analysis of individual testing settings. In more concrete terms, our investigation is motivated by translational expression quantitative trait loci (eQTL) studies in lung cancer. These studies involve both testing for groups of variants associated with a single gene expression (multiple explanatory factors) and testing whether a single variant is associated with a group of gene expressions (multiple outcomes). Results are supported by a collection of simulation studies and illustrated through lung cancer eQTL examples.

基于集合的关联检验因其能够聚合微弱信号并减轻多重检验负担而在遗传关联研究中广受欢迎。尤其是最近流行的一类基于集合的检验，包括高级批判、Berk-Jones 和其他统计方法，可以在信号稀少而微弱时达到所谓的检测边界。这类检验被应用于两种有细微差别的情况：(a) 将遗传变异集与单一表型联系起来；(b) 将单一遗传变异与表型集联系起来。在实践中，一个重要的问题是检验方法的选择，特别是在决定用创新型方法还是广义型方法进行检测边界检验时。文献中存在相互矛盾的指导意见。这项工作描述了相关结构如何在设置（a）和（b）中产生相对运行特征的明显差异。这对研究设计具有重要意义。我们还开发了新的功率边界，以方便上述计算，并允许对个别测试设置进行分析。更具体地说，我们的研究是受肺癌表达量性状位点（eQTL）转化研究的启发。这些研究既包括测试与单个基因表达（多个解释因素）相关的变异群，也包括测试单个变异是否与一组基因表达（多个结果）相关。结果得到了一系列模拟研究的支持，并通过肺癌 eQTL 示例进行了说明。

{"title":"Differences in set-based tests for sparse alternatives when testing sets of outcomes compared to sets of explanatory factors in genetic association studies.","authors":"Ryan Sun, Andy Shi, Xihong Lin","doi":"10.1093/biostatistics/kxac036","DOIUrl":"10.1093/biostatistics/kxac036","url":null,"abstract":"Set-based association tests are widely popular in genetic association settings for their ability to aggregate weak signals and reduce multiple testing burdens. In particular, a class of set-based tests including the Higher Criticism, Berk-Jones, and other statistics have recently been popularized for reaching a so-called detection boundary when signals are rare and weak. Such tests have been applied in two subtly different settings: (a) associating a genetic variant set with a single phenotype and (b) associating a single genetic variant with a phenotype set. A significant issue in practice is the choice of test, especially when deciding between innovated and generalized type methods for detection boundary tests. Conflicting guidance is present in the literature. This work describes how correlation structures generate marked differences in relative operating characteristics for settings (a) and (b). The implications for study design are significant. We also develop novel power bounds that facilitate the aforementioned calculations and allow for analysis of individual testing settings. In more concrete terms, our investigation is motivated by translational expression quantitative trait loci (eQTL) studies in lung cancer. These studies involve both testing for groups of variants associated with a single gene expression (multiple explanatory factors) and testing whether a single variant is associated with a group of gene expressions (multiple outcomes). Results are supported by a collection of simulation studies and illustrated through lung cancer eQTL examples.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"171-187"},"PeriodicalIF":1.8,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10724113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9632716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multilayer Exponential Family Factor models for integrative analysis and learning disease progression. 用于综合分析和学习疾病进展的多层指数族因子模型

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2023-12-15 DOI: 10.1093/biostatistics/kxac042

Qinxia Wang, Yuanjia Wang

Current diagnosis of neurological disorders often relies on late-stage clinical symptoms, which poses barriers to developing effective interventions at the premanifest stage. Recent research suggests that biomarkers and subtle changes in clinical markers may occur in a time-ordered fashion and can be used as indicators of early disease. In this article, we tackle the challenges to leverage multidomain markers to learn early disease progression of neurological disorders. We propose to integrate heterogeneous types of measures from multiple domains (e.g., discrete clinical symptoms, ordinal cognitive markers, continuous neuroimaging, and blood biomarkers) using a hierarchical Multilayer Exponential Family Factor (MEFF) model, where the observations follow exponential family distributions with lower-dimensional latent factors. The latent factors are decomposed into shared factors across multiple domains and domain-specific factors, where the shared factors provide robust information to perform extensive phenotyping and partition patients into clinically meaningful and biologically homogeneous subgroups. Domain-specific factors capture remaining unique variations for each domain. The MEFF model also captures nonlinear trajectory of disease progression and orders critical events of neurodegeneration measured by each marker. To overcome computational challenges, we fit our model by approximate inference techniques for large-scale data. We apply the developed method to Parkinson's Progression Markers Initiative data to integrate biological, clinical, and cognitive markers arising from heterogeneous distributions. The model learns lower-dimensional representations of Parkinson's disease (PD) and the temporal ordering of the neurodegeneration of PD.

目前对神经系统疾病的诊断通常依赖于晚期临床症状，这对在疾病显现前阶段制定有效的干预措施造成了障碍。最近的研究表明，生物标志物和临床标志物的微妙变化可能以时间顺序的方式发生，并可用作早期疾病的指标。在本文中，我们将探讨如何利用多域标记物来了解神经系统疾病的早期进展。我们建议使用分层多层指数族因子（MEFF）模型整合来自多个领域的异构类型的测量指标（如离散临床症状、序数认知标记、连续神经影像学和血液生物标记），在该模型中，观测值遵循具有低维潜在因子的指数族分布。潜在因子被分解为跨多个领域的共享因子和领域特异性因子，其中共享因子为进行广泛的表型分析和将患者划分为有临床意义的生物同质亚组提供了可靠的信息。领域特异性因子捕捉每个领域的其余独特变化。MEFF 模型还能捕捉疾病进展的非线性轨迹，并对每个标记物所测量的神经退行性变的关键事件进行排序。为了克服计算上的挑战，我们采用近似推理技术来拟合大规模数据的模型。我们将所开发的方法应用于帕金森病进展标志物倡议数据，以整合来自异质分布的生物、临床和认知标志物。该模型可以学习帕金森病（PD）的低维表征和帕金森病神经变性的时间顺序。

{"title":"Multilayer Exponential Family Factor models for integrative analysis and learning disease progression.","authors":"Qinxia Wang, Yuanjia Wang","doi":"10.1093/biostatistics/kxac042","DOIUrl":"10.1093/biostatistics/kxac042","url":null,"abstract":"Current diagnosis of neurological disorders often relies on late-stage clinical symptoms, which poses barriers to developing effective interventions at the premanifest stage. Recent research suggests that biomarkers and subtle changes in clinical markers may occur in a time-ordered fashion and can be used as indicators of early disease. In this article, we tackle the challenges to leverage multidomain markers to learn early disease progression of neurological disorders. We propose to integrate heterogeneous types of measures from multiple domains (e.g., discrete clinical symptoms, ordinal cognitive markers, continuous neuroimaging, and blood biomarkers) using a hierarchical Multilayer Exponential Family Factor (MEFF) model, where the observations follow exponential family distributions with lower-dimensional latent factors. The latent factors are decomposed into shared factors across multiple domains and domain-specific factors, where the shared factors provide robust information to perform extensive phenotyping and partition patients into clinically meaningful and biologically homogeneous subgroups. Domain-specific factors capture remaining unique variations for each domain. The MEFF model also captures nonlinear trajectory of disease progression and orders critical events of neurodegeneration measured by each marker. To overcome computational challenges, we fit our model by approximate inference techniques for large-scale data. We apply the developed method to Parkinson's Progression Markers Initiative data to integrate biological, clinical, and cognitive markers arising from heterogeneous distributions. The model learns lower-dimensional representations of Parkinson's disease (PD) and the temporal ordering of the neurodegeneration of PD.","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"203-219"},"PeriodicalIF":2.1,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10939400/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10653302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatiotemporal varying coefficient model for respiratory disease mapping in Taiwan. 用于绘制台湾呼吸道疾病地图的时空变化系数模型。

IF 2.1 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics

Pub Date : 2023-12-15 DOI: 10.1093/biostatistics/kxac046

Feifei Wang, Congyuan Duan, Yang Li, Hui Huang, Ben-Chang Shia

Respiratory diseases have been global public health problems for a long time. In recent years, air pollutants as important risk factors have drawn lots of attention. In this study, we investigate the influence of $pm2.5$ (particulate matters in diameter less than 2.5 ${rm{mu }} m$) on hospital visit rates for respiratory diseases in Taiwan. To reveal the spatiotemporal pattern of data, we propose a Bayesian disease mapping model with spatially varying coefficients and a parametric temporal trend. Model fitting is conducted using the integrated nested Laplace approximation, which is a widely applied technique for large-scale data sets due to its high computational efficiency. The finite sample performance of the proposed method is studied through a series of simulations. As demonstrated by simulations, the proposed model can improve both the parameter estimation performance and the prediction performance. We apply the proposed model on the respiratory disease data in 328 third-level administrative regions in Taiwan and find significant associations between hospital visit rates and $pm2.5$.

长期以来，呼吸系统疾病一直是全球性的公共卫生问题。近年来，空气污染物作为重要的危险因素受到广泛关注。在本研究中，我们调查了 $pm2.5$（直径小于 2.5 ${rm{mu }} m$ 的颗粒物）对台湾呼吸系统疾病住院率的影响。为了揭示数据的时空模式，我们提出了一个具有空间变化系数和参数时间趋势的贝叶斯疾病映射模型。模型拟合采用集成嵌套拉普拉斯近似法，该方法因计算效率高而被广泛应用于大规模数据集。通过一系列模拟研究了所提方法的有限样本性能。模拟结果表明，所提出的模型既能提高参数估计性能，又能提高预测性能。我们将提出的模型应用于台湾 328 个三级行政区域的呼吸系统疾病数据，发现医院就诊率与 $pm2.5$ 之间存在显著关联。

引用次数: 0