首页 > 最新文献

Biostatistics最新文献

英文 中文
Addressing the mean-variance relationship in spatially resolved transcriptomics data with spoon. 用spoon处理空间解析转录组学数据中的均方差关系。
IF 1.8 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf012
Kinnary Shah, Boyi Guo, Stephanie C Hicks

An important task in the analysis of spatially resolved transcriptomics (SRT) data is to identify spatially variable genes (SVGs), or genes that vary in a 2D space. Current approaches rank SVGs based on either $ P $-values or an effect size, such as the proportion of spatial variance. However, previous work in the analysis of RNA-sequencing data identified a technical bias with log-transformation, violating the "mean-variance relationship" of gene counts, where highly expressed genes are more likely to have a higher variance in counts but lower variance after log-transformation. Here, we demonstrate the mean-variance relationship in SRT data. Furthermore, we propose spoon, a statistical framework using empirical Bayes techniques to remove this bias, leading to more accurate prioritization of SVGs. We demonstrate the performance of spoon in both simulated and real SRT data. A software implementation of our method is available at https://bioconductor.org/packages/spoon.

空间解析转录组学(SRT)数据分析的一个重要任务是识别空间可变基因(SVGs),或在二维空间中变化的基因。目前的方法是根据P值或效应大小(如空间方差的比例)对svg进行排序。然而,之前在rna测序数据分析中的工作发现了对数转化的技术偏差,违反了基因计数的“均值-方差关系”,即高表达基因更有可能在计数上有更高的方差,但在对数转化后方差更低。在这里,我们展示了SRT数据中的均值-方差关系。此外,我们提出了spoon,一个使用经验贝叶斯技术的统计框架来消除这种偏见,从而更准确地确定svg的优先级。我们在模拟和真实的SRT数据中验证了spoon的性能。我们的方法的软件实现可以在https://bioconductor.org/packages/spoon上找到。
{"title":"Addressing the mean-variance relationship in spatially resolved transcriptomics data with spoon.","authors":"Kinnary Shah, Boyi Guo, Stephanie C Hicks","doi":"10.1093/biostatistics/kxaf012","DOIUrl":"10.1093/biostatistics/kxaf012","url":null,"abstract":"<p><p>An important task in the analysis of spatially resolved transcriptomics (SRT) data is to identify spatially variable genes (SVGs), or genes that vary in a 2D space. Current approaches rank SVGs based on either $ P $-values or an effect size, such as the proportion of spatial variance. However, previous work in the analysis of RNA-sequencing data identified a technical bias with log-transformation, violating the \"mean-variance relationship\" of gene counts, where highly expressed genes are more likely to have a higher variance in counts but lower variance after log-transformation. Here, we demonstrate the mean-variance relationship in SRT data. Furthermore, we propose spoon, a statistical framework using empirical Bayes techniques to remove this bias, leading to more accurate prioritization of SVGs. We demonstrate the performance of spoon in both simulated and real SRT data. A software implementation of our method is available at https://bioconductor.org/packages/spoon.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12166475/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144295418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Probabilistic clustering using shared latent variable model for assessing Alzheimer's disease biomarkers. 使用共享潜在变量模型评估阿尔茨海默病生物标志物的概率聚类。
IF 1.8 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf010
Yizhen Xu, Scott Zeger, Zheyu Wang

The preclinical stage of many neurodegenerative diseases can span decades before symptoms become apparent. Understanding the sequence of preclinical biomarker changes provides a critical opportunity for early diagnosis and effective intervention prior to significant loss of patients' brain functions. The main challenge to early detection lies in the absence of direct observation of the disease state and the considerable variability in both biomarkers and disease dynamics among individuals. Recent research hypothesized the existence of subgroups with distinct biomarker patterns due to co-morbidities and degrees of brain resilience. Our ability to diagnose early and intervene during the preclinical stage of neurodegenerative diseases will be enhanced by further insights into heterogeneity in the biomarker-disease relationship. In this article, we focus on Alzheimer's disease (AD) and attempt to identify the systematic patterns within the heterogeneous AD biomarker-disease cascade. Specifically, we quantify the disease progression using a dynamic latent variable whose mixture distribution represents patient subgroups. Model estimation uses Hamiltonian Monte Carlo with the number of clusters determined by the Bayesian Information Criterion. We report simulation studies that investigate the performance of the proposed model in finite sample settings that are similar to our motivating application. We apply the proposed model to the Biomarkers of Cognitive Decline Among Normal Individuals data, a longitudinal study that was conducted over 2 decades among individuals who were initially cognitively normal. Our application yields evidence consistent with the hypothetical model of biomarker dynamics presented in Jack Jr et al. In addition, our analysis identified 2 subgroups with distinct disease-onset patterns. Finally, we develop a dynamic prediction approach to improve the precision of prognoses.

许多神经退行性疾病的临床前阶段在症状变得明显之前可以跨越几十年。了解临床前生物标志物变化的顺序为早期诊断和有效干预提供了重要的机会,以便在患者脑功能显著丧失之前进行干预。早期检测的主要挑战在于缺乏对疾病状态的直接观察,以及个体之间生物标志物和疾病动态的相当大的差异。最近的研究假设存在具有不同生物标志物模式的亚群,由于合并症和大脑恢复能力的程度。通过进一步了解生物标志物与疾病关系的异质性,我们在神经退行性疾病的临床前阶段进行早期诊断和干预的能力将得到增强。在本文中,我们关注阿尔茨海默病(AD),并试图确定异质性AD生物标志物-疾病级联中的系统模式。具体来说,我们使用一个动态潜在变量来量化疾病进展,该变量的混合分布代表了患者亚组。模型估计采用哈密顿蒙特卡罗方法,聚类数量由贝叶斯信息准则确定。我们报告了模拟研究,研究了所提出的模型在有限样本设置中的性能,类似于我们的激励应用程序。我们将提出的模型应用于正常个体认知衰退的生物标志物数据,这是一项纵向研究,在最初认知正常的个体中进行了20多年。我们的应用产生了与Jack Jr等人提出的生物标志物动力学假设模型一致的证据。此外,我们的分析确定了2个具有不同发病模式的亚组。最后,我们开发了一种动态预测方法来提高预测的精度。
{"title":"Probabilistic clustering using shared latent variable model for assessing Alzheimer's disease biomarkers.","authors":"Yizhen Xu, Scott Zeger, Zheyu Wang","doi":"10.1093/biostatistics/kxaf010","DOIUrl":"10.1093/biostatistics/kxaf010","url":null,"abstract":"<p><p>The preclinical stage of many neurodegenerative diseases can span decades before symptoms become apparent. Understanding the sequence of preclinical biomarker changes provides a critical opportunity for early diagnosis and effective intervention prior to significant loss of patients' brain functions. The main challenge to early detection lies in the absence of direct observation of the disease state and the considerable variability in both biomarkers and disease dynamics among individuals. Recent research hypothesized the existence of subgroups with distinct biomarker patterns due to co-morbidities and degrees of brain resilience. Our ability to diagnose early and intervene during the preclinical stage of neurodegenerative diseases will be enhanced by further insights into heterogeneity in the biomarker-disease relationship. In this article, we focus on Alzheimer's disease (AD) and attempt to identify the systematic patterns within the heterogeneous AD biomarker-disease cascade. Specifically, we quantify the disease progression using a dynamic latent variable whose mixture distribution represents patient subgroups. Model estimation uses Hamiltonian Monte Carlo with the number of clusters determined by the Bayesian Information Criterion. We report simulation studies that investigate the performance of the proposed model in finite sample settings that are similar to our motivating application. We apply the proposed model to the Biomarkers of Cognitive Decline Among Normal Individuals data, a longitudinal study that was conducted over 2 decades among individuals who were initially cognitively normal. Our application yields evidence consistent with the hypothetical model of biomarker dynamics presented in Jack Jr et al. In addition, our analysis identified 2 subgroups with distinct disease-onset patterns. Finally, we develop a dynamic prediction approach to improve the precision of prognoses.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12054513/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144029768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models. 利用广义双线性模型对单细胞RNA-seq进行基于模型的降维。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf024
Phillip B Nicol, Jeffrey W Miller

Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.

降维是单细胞RNA-seq (scRNA-seq)数据分析的关键步骤。标准的方法是对计数矩阵进行变换,然后进行主成分分析(PCA)。然而,这种方法可以诱导虚假的异质性和掩盖真正的生物变异性。另一种方法是直接对计数进行建模,但现有方法在大型数据集上往往难以计算,并且不能量化低维表示中的不确定性。为了解决这些问题,我们开发了scGBM,这是一种使用泊松双线性模型对scRNA-seq数据进行基于模型的降维的新方法。我们引入了一种快速估计算法,使用迭代重加权奇异值分解来拟合模型,使该方法能够扩展到具有数百万单元格的数据集。此外,scGBM量化了每个细胞潜在位置的不确定性,并利用这些不确定性来评估与给定细胞集群相关的置信度。在真实和模拟的单细胞数据中,我们发现scGBM产生的低维嵌入可以更好地捕获相关的生物信息,同时消除不必要的变异。
{"title":"Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models.","authors":"Phillip B Nicol, Jeffrey W Miller","doi":"10.1093/biostatistics/kxaf024","DOIUrl":"10.1093/biostatistics/kxaf024","url":null,"abstract":"<p><p>Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network generalized estimating equations for complexly correlated data with applications to cluster randomized trials. 复杂相关数据的网络广义估计方程及其在聚类随机试验中的应用。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf039
Tom Chen, Fan Li, Rui Wang

Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.

在聚类随机试验(crt)中,估计与平均结果相对应的参数及其复杂的关联结构可能会带来重大的方法学挑战。本文介绍了一个新的框架,利用网络概念来表示复杂的依赖结构,并使用广义估计方程(GEE)来估计这些参数。我们通过将观测数据划分为相互关联数据的潜在重叠组来关注复杂相关结构的建模,其中观测数据假设在每个组内可局部交换。该网络GEE框架具有固有的灵活性,并演示了其在多个可交换结构(简单、嵌套、块)、移动平均结构和指数衰减结构中的应用。此外,为了解决在具有大集群规模的GEEs中出现的计算挑战,我们提出了networkGEE R包,使模型的拟合超出了现有统计软件的能力。通过广泛的模拟研究对所提出的方法进行了评估。为了说明它们的实际应用,我们分析了华盛顿州加速伴侣治疗试验的数据,这是一项旨在评估公共卫生干预的影响,旨在通过患者提供的免费伴侣治疗来减少性传播感染。
{"title":"Network generalized estimating equations for complexly correlated data with applications to cluster randomized trials.","authors":"Tom Chen, Fan Li, Rui Wang","doi":"10.1093/biostatistics/kxaf039","DOIUrl":"https://doi.org/10.1093/biostatistics/kxaf039","url":null,"abstract":"<p><p>Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-study R-learner for estimating heterogeneous treatment effects across studies using statistical machine learning. 多研究r学习器,用于使用统计机器学习估计跨研究的异质治疗效果。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf040
Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani

Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.

异质性治疗效应(Heterogeneous treatment effect, HTE)是指群体中个体治疗效果的非随机、可解释的变异。HTE估计是精准医疗的核心,准确的效果估计可以为个性化治疗决策提供信息。在实践中,患者可以呈现与多个研究重叠的协变量概况,这增加了在多研究环境中为治疗决策提供最佳信息的挑战。我们提出了一个灵活的统计机器学习(ML)框架,即多研究R学习器,它利用多个研究来估计HTE。现有的多研究方法通常假设研究特异性(i)条件平均治疗效果(CATE), (ii)在没有给定协变量的治疗下的预期潜在结果,以及(iii)治疗分配机制在研究中是相同的,但由于研究人群、方案或设计的差异,这些假设在实践中可能不成立。为此,我们开发了我们的框架来直接解释这三种类型的研究间异质性。它建立在交叉研究学习的最新进展基础上,并使用数据自适应目标函数,通过隶属关系概率将交叉研究中妨害函数的估计与研究特定的CATEs结合起来,从而使信息能够跨研究借鉴。多学习$ R $学习器将$ R $学习器扩展到多学习环境,并且在结合ML技术方面具有灵活性。在序列估计框架中,我们证明了所提出的方法是渐近正态的,并且在治疗分配机制存在研究间异质性时比$ R $学习器更有效。我们使用随机对照试验和观察性研究的癌症数据说明,在研究间异质性存在的情况下,多研究$ R $学习器表现良好。
{"title":"Multi-study R-learner for estimating heterogeneous treatment effects across studies using statistical machine learning.","authors":"Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani","doi":"10.1093/biostatistics/kxaf040","DOIUrl":"10.1093/biostatistics/kxaf040","url":null,"abstract":"<p><p>Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12713001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian scalar-on-tensor regression using the Tucker decomposition for sparse spatial modeling. 利用Tucker分解的贝叶斯张量标量回归进行稀疏空间建模。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf029
Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado

Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.

使用多维数组或张量进行建模通常会由于高维性而出现问题。此外,这些结构通常表现出固有的稀疏性,需要使用正则化方法来适当地表征张量协变量和标量响应之间的关联。为了保留张量系数内的空间关系,我们提出了一种贝叶斯方法来有效地用张量协变量建模标量响应,同时减少模型内变化的参数数量并应用正则化方法。对模拟数据进行了分析,将该模型与最近提出的方法进行了比较。使用阿尔茨海默氏症数据神经成像计划数据的神经成像分析显示,与其他张量回归方法相比,推理性能有所提高。贝叶斯分析;张量分解;图像分析;空间数据;统计建模。
{"title":"Bayesian scalar-on-tensor regression using the Tucker decomposition for sparse spatial modeling.","authors":"Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado","doi":"10.1093/biostatistics/kxaf029","DOIUrl":"https://doi.org/10.1093/biostatistics/kxaf029","url":null,"abstract":"<p><p>Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Regression and alignment for functional data and network topology. 功能数据和网络拓扑的回归和配准。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxae026
Danni Tu, Julia Wrobel, Theodore D Satterthwaite, Jeff Goldsmith, Ruben C Gur, Raquel E Gur, Jan Gertheiss, Dani S Bassett, Russell T Shinohara

In the brain, functional connections form a network whose topological organization can be described by graph-theoretic network diagnostics. These include characterizations of the community structure, such as modularity and participation coefficient, which have been shown to change over the course of childhood and adolescence. To investigate if such changes in the functional network are associated with changes in cognitive performance during development, network studies often rely on an arbitrary choice of preprocessing parameters, in particular the proportional threshold of network edges. Because the choice of parameter can impact the value of the network diagnostic, and therefore downstream conclusions, we propose to circumvent that choice by conceptualizing the network diagnostic as a function of the parameter. As opposed to a single value, a network diagnostic curve describes the connectome topology at multiple scales-from the sparsest group of the strongest edges to the entire edge set. To relate these curves to executive function and other covariates, we use scalar-on-function regression, which is more flexible than previous functional data-based models used in network neuroscience. We then consider how systematic differences between networks can manifest in misalignment of diagnostic curves, and consequently propose a supervised curve alignment method that incorporates auxiliary information from other variables. Our algorithm performs both functional regression and alignment via an iterative, penalized, and nonlinear likelihood optimization. The illustrated method has the potential to improve the interpretability and generalizability of neuroscience studies where the goal is to study heterogeneity among a mixture of function- and scalar-valued measures.

在大脑中,功能连接形成了一个网络,其拓扑组织可以通过图论网络诊断来描述。其中包括群落结构的特征,如模块化和参与系数,这些特征已被证明会随着儿童和青少年时期的变化而变化。为了研究功能网络的这种变化是否与发育过程中认知能力的变化有关,网络研究通常依赖于对预处理参数的任意选择,特别是网络边缘的比例阈值。由于参数的选择会影响网络诊断的值,从而影响下游结论,因此我们建议将网络诊断概念化为参数的函数,以规避这种选择。与单一数值不同,网络诊断曲线描述了多个尺度的连接组拓扑结构--从最稀疏的最强边缘组到整个边缘集。为了将这些曲线与执行功能和其他协变量联系起来,我们使用了标量-功能回归,这比以往网络神经科学中使用的基于功能数据的模型更加灵活。然后,我们考虑了网络之间的系统性差异如何表现为诊断曲线的不对齐,并因此提出了一种包含其他变量辅助信息的监督曲线对齐方法。我们的算法通过迭代、惩罚和非线性似然优化来执行函数回归和配准。这种方法有望提高神经科学研究的可解释性和可推广性,因为神经科学研究的目标是研究函数值和标量值混合测量的异质性。
{"title":"Regression and alignment for functional data and network topology.","authors":"Danni Tu, Julia Wrobel, Theodore D Satterthwaite, Jeff Goldsmith, Ruben C Gur, Raquel E Gur, Jan Gertheiss, Dani S Bassett, Russell T Shinohara","doi":"10.1093/biostatistics/kxae026","DOIUrl":"10.1093/biostatistics/kxae026","url":null,"abstract":"<p><p>In the brain, functional connections form a network whose topological organization can be described by graph-theoretic network diagnostics. These include characterizations of the community structure, such as modularity and participation coefficient, which have been shown to change over the course of childhood and adolescence. To investigate if such changes in the functional network are associated with changes in cognitive performance during development, network studies often rely on an arbitrary choice of preprocessing parameters, in particular the proportional threshold of network edges. Because the choice of parameter can impact the value of the network diagnostic, and therefore downstream conclusions, we propose to circumvent that choice by conceptualizing the network diagnostic as a function of the parameter. As opposed to a single value, a network diagnostic curve describes the connectome topology at multiple scales-from the sparsest group of the strongest edges to the entire edge set. To relate these curves to executive function and other covariates, we use scalar-on-function regression, which is more flexible than previous functional data-based models used in network neuroscience. We then consider how systematic differences between networks can manifest in misalignment of diagnostic curves, and consequently propose a supervised curve alignment method that incorporates auxiliary information from other variables. Our algorithm performs both functional regression and alignment via an iterative, penalized, and nonlinear likelihood optimization. The illustrated method has the potential to improve the interpretability and generalizability of neuroscience studies where the goal is to study heterogeneity among a mixture of function- and scalar-valued measures.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11822954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141977263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease. 多视图数据集成与预测的可扩展随机核方法及其在冠状病毒病中的应用。
IF 1.8 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf001
Sandra E Safo, Han Lu

There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.

尽管大流行已经过去了4年,但关于冠状病毒病(COVID-19)的病理生物学,我们还有更多需要了解的。多组学方法提供了对该疾病的全面看法,并有可能对该疾病的发病机制产生更深入的了解。以往对COVID-19严重程度和病情的多组学综合分析和预测研究假设组学数据之间以及组学与COVID-19结局之间存在简单关系(即线性关系)。然而,这些线性方法并没有考虑到与这些不同类型的数据相关的固有的潜在非线性结构。这项工作背后的动机是模拟多组学和COVID-19结果之间的非线性关系,并确定与该疾病相关的关键多维分子。为了实现这一目标,我们开发了可扩展的随机核方法,用于联合关联来自多个来源或视图的数据,并同时预测结果或将单元分类为两个或多个类之一。我们还确定最有助于视图之间关系的变量或变量组。我们使用随机傅里叶基可以近似移位不变核函数的思想来构造每个视图的非线性映射,并使用这些映射和结果变量来学习与视图无关的低维表示。我们通过大量的仿真证明了所提出方法的有效性。将所提出的方法应用于与COVID-19相关的基因表达、代谢组学、蛋白质组学和脂质组学数据时,我们确定了COVID-19状态和严重程度的几个分子特征。我们的结果与先前的发现一致,并为未来的研究提供了潜在的途径。我们的算法是在Pytorch中实现的,并在R中接口,可在:https://github.com/lasandrall/RandMVLearn。
{"title":"Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease.","authors":"Sandra E Safo, Han Lu","doi":"10.1093/biostatistics/kxaf001","DOIUrl":"10.1093/biostatistics/kxaf001","url":null,"abstract":"<p><p>There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839864/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143460884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mediation with External Summary Statistic Information. 带有外部汇总统计信息的中介。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf020
Jonathan Boss, Wei Hao, Amber Cathey, Barrett M Welch, Kelly K Ferguson, John D Meeker, Xiang Zhou, Jian Kang, Bhramar Mukherjee

Environmental health studies are increasingly measuring endogenous omics data ($ boldsymbol{M} $) to study intermediary biological pathways by which an exogenous exposure ($ boldsymbol{A} $) affects a health outcome ($ boldsymbol{Y} $), given confounders ($ boldsymbol{C} $). Mediation analysis is frequently performed to understand such mechanisms. If intermediary pathways are of interest, then there is likely literature establishing statistical and biological significance of the total effect, defined as the effect of $ boldsymbol{A} $ on $ boldsymbol{Y} $ given $ boldsymbol{C} $. For mediation models with continuous outcomes and mediators, we show that leveraging external summary-level information on the total effect can improve estimation efficiency of the direct and indirect effects. Moreover, the efficiency gain depends on the asymptotic partial $ R^{2} $ between the outcome ($ boldsymbol{Y}midboldsymbol{M},boldsymbol{A},boldsymbol{C} $) and total effect ($ boldsymbol{Y}midboldsymbol{A},boldsymbol{C} $) models, with smaller (larger) values benefiting direct (indirect) effect estimation. We propose a robust data-adaptive estimation procedure, Mediation with External Summary Statistic Information, to improve estimation efficiency in settings with congenial external information, while simultaneously protecting against bias in settings with incongenial external information. In congenial simulation scenarios, we observe relative efficiency gains for mediation effect estimation of up to 40%. We illustrate our methodology using data from the Puerto Rico Testsite for Exploring Contamination Threats, where Cytochrome p450 metabolites are hypothesized to mediate the effect of phthalate exposure on gestational age at delivery. External summary information on the total effect comes from a recently published pooled analysis of 16 studies. The proposed framework blends mediation analysis with emerging data integration techniques.

环境健康研究越来越多地测量内源性组学数据($ boldsymbol{M} $),以研究外源性暴露($ boldsymbol{A} $)在给定混杂因素($ boldsymbol{C} $)的情况下影响健康结果($ boldsymbol{Y} $)的中间生物学途径。经常执行中介分析来理解此类机制。如果对中间途径感兴趣,那么可能有文献建立了总效应的统计和生物学显著性,定义为给定$ boldsymbol{C} $, $ boldsymbol{A} $对$ boldsymbol{Y} $的影响。对于具有连续结果和中介的中介模型,我们表明利用总效应的外部摘要级信息可以提高直接和间接效应的估计效率。此外,效率增益取决于结果($ boldsymbol{Y}midboldsymbol{M},boldsymbol{A},boldsymbol{C} $)和总效果($ boldsymbol{Y}midboldsymbol{A},boldsymbol{C} $)模型之间的渐近偏R^{2} $,较小(较大)的值有利于直接(间接)效果估计。我们提出了一种鲁棒的数据自适应估计方法,即外部汇总统计信息的中介,以提高在外部信息一致的情况下的估计效率,同时防止外部信息不一致的情况下的偏差。在相似的模拟场景中,我们观察到中介效应估计的相对效率增益高达40%。我们使用波多黎各污染威胁探索试验场的数据来说明我们的方法,其中细胞色素p450代谢物被假设为介导邻苯二甲酸盐暴露对分娩时胎龄的影响。关于总效应的外部总结信息来自最近发表的对16项研究的汇总分析。提出的框架将中介分析与新兴的数据集成技术相结合。
{"title":"Mediation with External Summary Statistic Information.","authors":"Jonathan Boss, Wei Hao, Amber Cathey, Barrett M Welch, Kelly K Ferguson, John D Meeker, Xiang Zhou, Jian Kang, Bhramar Mukherjee","doi":"10.1093/biostatistics/kxaf020","DOIUrl":"10.1093/biostatistics/kxaf020","url":null,"abstract":"<p><p>Environmental health studies are increasingly measuring endogenous omics data ($ boldsymbol{M} $) to study intermediary biological pathways by which an exogenous exposure ($ boldsymbol{A} $) affects a health outcome ($ boldsymbol{Y} $), given confounders ($ boldsymbol{C} $). Mediation analysis is frequently performed to understand such mechanisms. If intermediary pathways are of interest, then there is likely literature establishing statistical and biological significance of the total effect, defined as the effect of $ boldsymbol{A} $ on $ boldsymbol{Y} $ given $ boldsymbol{C} $. For mediation models with continuous outcomes and mediators, we show that leveraging external summary-level information on the total effect can improve estimation efficiency of the direct and indirect effects. Moreover, the efficiency gain depends on the asymptotic partial $ R^{2} $ between the outcome ($ boldsymbol{Y}midboldsymbol{M},boldsymbol{A},boldsymbol{C} $) and total effect ($ boldsymbol{Y}midboldsymbol{A},boldsymbol{C} $) models, with smaller (larger) values benefiting direct (indirect) effect estimation. We propose a robust data-adaptive estimation procedure, Mediation with External Summary Statistic Information, to improve estimation efficiency in settings with congenial external information, while simultaneously protecting against bias in settings with incongenial external information. In congenial simulation scenarios, we observe relative efficiency gains for mediation effect estimation of up to 40%. We illustrate our methodology using data from the Puerto Rico Testsite for Exploring Contamination Threats, where Cytochrome p450 metabolites are hypothesized to mediate the effect of phthalate exposure on gestational age at delivery. External summary information on the total effect comes from a recently published pooled analysis of 16 studies. The proposed framework blends mediation analysis with emerging data integration techniques.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12302958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144735537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semiparametric efficient estimation of small genetic effects in large-scale population cohorts. 大规模群体群体中小遗传效应的半参数有效估计。
IF 2 3区 数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-31 DOI: 10.1093/biostatistics/kxaf030
Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J van der Laan, Chris P Ponting, Sjoerd V Beentjes, Ava Khamseh

Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes and tight confidence intervals are expected, necessitates minimizing model-misspecification bias to increase power and control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including $ k $-point interactions among categorical variables in the presence of confounding and weak population dependence. $ k $-point interactions, or Average Interaction Effects (AIEs), are a direct generalization of the usual average treatment effect (ATE). We estimate genetic effects with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE $ k $-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms.

群体遗传学试图量化DNA变异与性状或疾病的关联,以及变异之间和与环境因素的相互作用。在大型队列中计算数以百万计的估计,其中预期的效应大小较小,置信区间较紧,需要最小化模型错配偏差,以增加功率并控制错误发现。我们提出了TarGene,一个统一的统计工作流程,用于遗传效应的半参数有效和双鲁棒估计,包括在混杂和弱种群依赖性存在下分类变量之间的$ k $点相互作用。k点相互作用,或平均相互作用效应(AIEs),是通常的平均治疗效果(ATE)的直接概括。我们使用交叉验证和/或加权版本的基于目标最小损失的估计器(TMLE)和一步估计器(OSE)来估计遗传效应。利用基于单元间遗传相关性的平台方差估计修正了数据单元间的相关性对方差估计的影响。我们提出了广泛的现实模拟,以展示功率,覆盖范围和控制类型I错误。我们的激励应用是在大型基因组数据库(如UK Biobank和All of Us)中有针对性地估计遗传对性状的影响,包括两点和高阶基因-基因和基因-环境相互作用。用于AIE $ k $点交互的所有交叉验证和/或加权TMLE和OSE,以及ate、条件ate及其函数,都在通用的Julia包TMLE. j1中实现。对于人口基因组学的高通量应用,我们提供开源的Nextflow管道和软件TarGene,与现代高性能和云计算平台无缝集成。
{"title":"Semiparametric efficient estimation of small genetic effects in large-scale population cohorts.","authors":"Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J van der Laan, Chris P Ponting, Sjoerd V Beentjes, Ava Khamseh","doi":"10.1093/biostatistics/kxaf030","DOIUrl":"10.1093/biostatistics/kxaf030","url":null,"abstract":"<p><p>Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes and tight confidence intervals are expected, necessitates minimizing model-misspecification bias to increase power and control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including $ k $-point interactions among categorical variables in the presence of confounding and weak population dependence. $ k $-point interactions, or Average Interaction Effects (AIEs), are a direct generalization of the usual average treatment effect (ATE). We estimate genetic effects with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE $ k $-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12479317/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biostatistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1