Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf012
Kinnary Shah, Boyi Guo, Stephanie C Hicks
An important task in the analysis of spatially resolved transcriptomics (SRT) data is to identify spatially variable genes (SVGs), or genes that vary in a 2D space. Current approaches rank SVGs based on either $ P $-values or an effect size, such as the proportion of spatial variance. However, previous work in the analysis of RNA-sequencing data identified a technical bias with log-transformation, violating the "mean-variance relationship" of gene counts, where highly expressed genes are more likely to have a higher variance in counts but lower variance after log-transformation. Here, we demonstrate the mean-variance relationship in SRT data. Furthermore, we propose spoon, a statistical framework using empirical Bayes techniques to remove this bias, leading to more accurate prioritization of SVGs. We demonstrate the performance of spoon in both simulated and real SRT data. A software implementation of our method is available at https://bioconductor.org/packages/spoon.
{"title":"Addressing the mean-variance relationship in spatially resolved transcriptomics data with spoon.","authors":"Kinnary Shah, Boyi Guo, Stephanie C Hicks","doi":"10.1093/biostatistics/kxaf012","DOIUrl":"10.1093/biostatistics/kxaf012","url":null,"abstract":"<p><p>An important task in the analysis of spatially resolved transcriptomics (SRT) data is to identify spatially variable genes (SVGs), or genes that vary in a 2D space. Current approaches rank SVGs based on either $ P $-values or an effect size, such as the proportion of spatial variance. However, previous work in the analysis of RNA-sequencing data identified a technical bias with log-transformation, violating the \"mean-variance relationship\" of gene counts, where highly expressed genes are more likely to have a higher variance in counts but lower variance after log-transformation. Here, we demonstrate the mean-variance relationship in SRT data. Furthermore, we propose spoon, a statistical framework using empirical Bayes techniques to remove this bias, leading to more accurate prioritization of SVGs. We demonstrate the performance of spoon in both simulated and real SRT data. A software implementation of our method is available at https://bioconductor.org/packages/spoon.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12166475/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144295418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf010
Yizhen Xu, Scott Zeger, Zheyu Wang
The preclinical stage of many neurodegenerative diseases can span decades before symptoms become apparent. Understanding the sequence of preclinical biomarker changes provides a critical opportunity for early diagnosis and effective intervention prior to significant loss of patients' brain functions. The main challenge to early detection lies in the absence of direct observation of the disease state and the considerable variability in both biomarkers and disease dynamics among individuals. Recent research hypothesized the existence of subgroups with distinct biomarker patterns due to co-morbidities and degrees of brain resilience. Our ability to diagnose early and intervene during the preclinical stage of neurodegenerative diseases will be enhanced by further insights into heterogeneity in the biomarker-disease relationship. In this article, we focus on Alzheimer's disease (AD) and attempt to identify the systematic patterns within the heterogeneous AD biomarker-disease cascade. Specifically, we quantify the disease progression using a dynamic latent variable whose mixture distribution represents patient subgroups. Model estimation uses Hamiltonian Monte Carlo with the number of clusters determined by the Bayesian Information Criterion. We report simulation studies that investigate the performance of the proposed model in finite sample settings that are similar to our motivating application. We apply the proposed model to the Biomarkers of Cognitive Decline Among Normal Individuals data, a longitudinal study that was conducted over 2 decades among individuals who were initially cognitively normal. Our application yields evidence consistent with the hypothetical model of biomarker dynamics presented in Jack Jr et al. In addition, our analysis identified 2 subgroups with distinct disease-onset patterns. Finally, we develop a dynamic prediction approach to improve the precision of prognoses.
{"title":"Probabilistic clustering using shared latent variable model for assessing Alzheimer's disease biomarkers.","authors":"Yizhen Xu, Scott Zeger, Zheyu Wang","doi":"10.1093/biostatistics/kxaf010","DOIUrl":"10.1093/biostatistics/kxaf010","url":null,"abstract":"<p><p>The preclinical stage of many neurodegenerative diseases can span decades before symptoms become apparent. Understanding the sequence of preclinical biomarker changes provides a critical opportunity for early diagnosis and effective intervention prior to significant loss of patients' brain functions. The main challenge to early detection lies in the absence of direct observation of the disease state and the considerable variability in both biomarkers and disease dynamics among individuals. Recent research hypothesized the existence of subgroups with distinct biomarker patterns due to co-morbidities and degrees of brain resilience. Our ability to diagnose early and intervene during the preclinical stage of neurodegenerative diseases will be enhanced by further insights into heterogeneity in the biomarker-disease relationship. In this article, we focus on Alzheimer's disease (AD) and attempt to identify the systematic patterns within the heterogeneous AD biomarker-disease cascade. Specifically, we quantify the disease progression using a dynamic latent variable whose mixture distribution represents patient subgroups. Model estimation uses Hamiltonian Monte Carlo with the number of clusters determined by the Bayesian Information Criterion. We report simulation studies that investigate the performance of the proposed model in finite sample settings that are similar to our motivating application. We apply the proposed model to the Biomarkers of Cognitive Decline Among Normal Individuals data, a longitudinal study that was conducted over 2 decades among individuals who were initially cognitively normal. Our application yields evidence consistent with the hypothetical model of biomarker dynamics presented in Jack Jr et al. In addition, our analysis identified 2 subgroups with distinct disease-onset patterns. Finally, we develop a dynamic prediction approach to improve the precision of prognoses.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12054513/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144029768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf024
Phillip B Nicol, Jeffrey W Miller
Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.
{"title":"Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models.","authors":"Phillip B Nicol, Jeffrey W Miller","doi":"10.1093/biostatistics/kxaf024","DOIUrl":"10.1093/biostatistics/kxaf024","url":null,"abstract":"<p><p>Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal components analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these problems, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. We introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions, enabling the method to scale to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell's latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12342792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf039
Tom Chen, Fan Li, Rui Wang
Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.
{"title":"Network generalized estimating equations for complexly correlated data with applications to cluster randomized trials.","authors":"Tom Chen, Fan Li, Rui Wang","doi":"10.1093/biostatistics/kxaf039","DOIUrl":"https://doi.org/10.1093/biostatistics/kxaf039","url":null,"abstract":"<p><p>Estimating parameters corresponding to mean outcomes and their intricate association structures in cluster randomized trials (CRTs) can pose significant methodological challenges. This paper introduces a novel framework that leverages network concepts to represent complex dependency structures and estimate these parameters using generalized estimating equations (GEE). We focus on modeling complex correlation structures by partitioning observations into potentially overlapping groups of interrelated data, where observations are assumed locally exchangeable within each group. This network GEE framework is inherently flexible, and we demonstrate its application to multiple exchangeable structures (simple, nested, block), moving average structures, and exponential decay structures. Furthermore, to address computational challenges arising in GEEs with large cluster sizes, we present the networkGEE R package, enabling the fitting of models beyond the capabilities of existing statistical software. The proposed methods are evaluated through extensive simulation studies. To illustrate their practical application, we analyze data from the Washington State Expedited Partners Therapy trial, a stepped-wedge CRT designed to assess the impact of a public health intervention aimed at reducing sexually transmitted infections through free patient-delivered partner therapy.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf040
Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani
Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.
异质性治疗效应(Heterogeneous treatment effect, HTE)是指群体中个体治疗效果的非随机、可解释的变异。HTE估计是精准医疗的核心,准确的效果估计可以为个性化治疗决策提供信息。在实践中,患者可以呈现与多个研究重叠的协变量概况,这增加了在多研究环境中为治疗决策提供最佳信息的挑战。我们提出了一个灵活的统计机器学习(ML)框架,即多研究R学习器,它利用多个研究来估计HTE。现有的多研究方法通常假设研究特异性(i)条件平均治疗效果(CATE), (ii)在没有给定协变量的治疗下的预期潜在结果,以及(iii)治疗分配机制在研究中是相同的,但由于研究人群、方案或设计的差异,这些假设在实践中可能不成立。为此,我们开发了我们的框架来直接解释这三种类型的研究间异质性。它建立在交叉研究学习的最新进展基础上,并使用数据自适应目标函数,通过隶属关系概率将交叉研究中妨害函数的估计与研究特定的CATEs结合起来,从而使信息能够跨研究借鉴。多学习$ R $学习器将$ R $学习器扩展到多学习环境,并且在结合ML技术方面具有灵活性。在序列估计框架中,我们证明了所提出的方法是渐近正态的,并且在治疗分配机制存在研究间异质性时比$ R $学习器更有效。我们使用随机对照试验和观察性研究的癌症数据说明,在研究间异质性存在的情况下,多研究$ R $学习器表现良好。
{"title":"Multi-study R-learner for estimating heterogeneous treatment effects across studies using statistical machine learning.","authors":"Cathy Shyr, Boyu Ren, Prasad Patil, Giovanni Parmigiani","doi":"10.1093/biostatistics/kxaf040","DOIUrl":"10.1093/biostatistics/kxaf040","url":null,"abstract":"<p><p>Heterogeneous treatment effect (HTE) refers to the nonrandom, explainable variation in treatment effects for individuals in a population. HTE estimation is central to precision medicine, where accurate effect estimates can inform personalized treatment decisions. In practice, patients can present with covariate profiles that overlap with multiple studies, raising the challenge of optimally informing treatment decisions in a multi-study setting. We proposed a flexible statistical machine learning (ML) framework, the multi-study $ R $-learner, that leverages multiple studies to estimate the HTE. Existing multi-study approaches often assume that study-specific (i) conditional average treatment effect (CATE), (ii) expected potential outcome under no treatment given covariates, and (iii) treatment assignment mechanism are identical across studies, but these assumptions may not hold in practice due to differences in study populations, protocols, or designs. To this end, we developed our framework to directly account for these three types of between-study heterogeneity. It builds upon recent advances in cross-study learning and uses a data-adaptive objective function to combine cross-study estimates of nuisance functions with study-specific CATEs via membership probabilities, which enable information to be borrowed across studies. The multi-study $ R $-learner extends the $ R $-learner to the multi-study setting and is flexible in its ability to incorporate ML techniques. In the series estimation framework, we showed that the proposed method is asymptotically normal and more efficient than the $ R $-learner when there is between-study heterogeneity in the treatment assignment mechanisms. We illustrated using cancer data from randomized controlled trials and observational studies that the multi-study $ R $-learner performs favorably in the presence of between-study heterogeneity.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12713001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf029
Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado
Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.
{"title":"Bayesian scalar-on-tensor regression using the Tucker decomposition for sparse spatial modeling.","authors":"Daniel A Spencer, Rene Gutierrez, Rajarshi Guhaniyogi, Russell T Shinohara, Raquel Prado","doi":"10.1093/biostatistics/kxaf029","DOIUrl":"https://doi.org/10.1093/biostatistics/kxaf029","url":null,"abstract":"<p><p>Modeling with multidimensional arrays, or tensors, often presents a problem due to high dimensionality. In addition, these structures typically exhibit inherent sparsity, requiring the use of regularization methods to properly characterize an association between a tensor covariate and a scalar response. We propose a Bayesian method to efficiently model a scalar response with a tensor covariate using the Tucker tensor decomposition in order to retain the spatial relationship within a tensor coefficient, while reducing the number of parameters varying within the model and applying regularization methods. Simulated data are analyzed to compare the model to recently proposed methods. A neuroimaging analysis using data from the Alzheimer's Data Neuroimaging Initiative shows improved inferential performance compared with other tensor regression methods. Bayesian analysis; tensor decomposition; image analysis; spatial statistics; statistical modeling.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxae026
Danni Tu, Julia Wrobel, Theodore D Satterthwaite, Jeff Goldsmith, Ruben C Gur, Raquel E Gur, Jan Gertheiss, Dani S Bassett, Russell T Shinohara
In the brain, functional connections form a network whose topological organization can be described by graph-theoretic network diagnostics. These include characterizations of the community structure, such as modularity and participation coefficient, which have been shown to change over the course of childhood and adolescence. To investigate if such changes in the functional network are associated with changes in cognitive performance during development, network studies often rely on an arbitrary choice of preprocessing parameters, in particular the proportional threshold of network edges. Because the choice of parameter can impact the value of the network diagnostic, and therefore downstream conclusions, we propose to circumvent that choice by conceptualizing the network diagnostic as a function of the parameter. As opposed to a single value, a network diagnostic curve describes the connectome topology at multiple scales-from the sparsest group of the strongest edges to the entire edge set. To relate these curves to executive function and other covariates, we use scalar-on-function regression, which is more flexible than previous functional data-based models used in network neuroscience. We then consider how systematic differences between networks can manifest in misalignment of diagnostic curves, and consequently propose a supervised curve alignment method that incorporates auxiliary information from other variables. Our algorithm performs both functional regression and alignment via an iterative, penalized, and nonlinear likelihood optimization. The illustrated method has the potential to improve the interpretability and generalizability of neuroscience studies where the goal is to study heterogeneity among a mixture of function- and scalar-valued measures.
{"title":"Regression and alignment for functional data and network topology.","authors":"Danni Tu, Julia Wrobel, Theodore D Satterthwaite, Jeff Goldsmith, Ruben C Gur, Raquel E Gur, Jan Gertheiss, Dani S Bassett, Russell T Shinohara","doi":"10.1093/biostatistics/kxae026","DOIUrl":"10.1093/biostatistics/kxae026","url":null,"abstract":"<p><p>In the brain, functional connections form a network whose topological organization can be described by graph-theoretic network diagnostics. These include characterizations of the community structure, such as modularity and participation coefficient, which have been shown to change over the course of childhood and adolescence. To investigate if such changes in the functional network are associated with changes in cognitive performance during development, network studies often rely on an arbitrary choice of preprocessing parameters, in particular the proportional threshold of network edges. Because the choice of parameter can impact the value of the network diagnostic, and therefore downstream conclusions, we propose to circumvent that choice by conceptualizing the network diagnostic as a function of the parameter. As opposed to a single value, a network diagnostic curve describes the connectome topology at multiple scales-from the sparsest group of the strongest edges to the entire edge set. To relate these curves to executive function and other covariates, we use scalar-on-function regression, which is more flexible than previous functional data-based models used in network neuroscience. We then consider how systematic differences between networks can manifest in misalignment of diagnostic curves, and consequently propose a supervised curve alignment method that incorporates auxiliary information from other variables. Our algorithm performs both functional regression and alignment via an iterative, penalized, and nonlinear likelihood optimization. The illustrated method has the potential to improve the interpretability and generalizability of neuroscience studies where the goal is to study heterogeneity among a mixture of function- and scalar-valued measures.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11822954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141977263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf001
Sandra E Safo, Han Lu
There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.
{"title":"Scalable randomized kernel methods for multiview data integration and prediction with application to Coronavirus disease.","authors":"Sandra E Safo, Han Lu","doi":"10.1093/biostatistics/kxaf001","DOIUrl":"10.1093/biostatistics/kxaf001","url":null,"abstract":"<p><p>There is still more to learn about the pathobiology of coronavirus disease (COVID-19) despite 4 years of the pandemic. A multiomics approach offers a comprehensive view of the disease and has the potential to yield deeper insight into the pathogenesis of the disease. Previous multiomics integrative analysis and prediction studies for COVID-19 severity and status have assumed simple relationships (ie linear relationships) between omics data and between omics and COVID-19 outcomes. However, these linear methods do not account for the inherent underlying nonlinear structure associated with these different types of data. The motivation behind this work is to model nonlinear relationships in multiomics and COVID-19 outcomes, and to determine key multidimensional molecules associated with the disease. Toward this goal, we develop scalable randomized kernel methods for jointly associating data from multiple sources or views and simultaneously predicting an outcome or classifying a unit into one of 2 or more classes. We also determine variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. We demonstrate the effectiveness of the proposed methods through extensive simulations. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures for COVID-19 status and severity. Our results agree with previous findings and suggest potential avenues for future research. Our algorithms are implemented in Pytorch and interfaced in R and available at: https://github.com/lasandrall/RandMVLearn.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839864/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143460884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf020
Jonathan Boss, Wei Hao, Amber Cathey, Barrett M Welch, Kelly K Ferguson, John D Meeker, Xiang Zhou, Jian Kang, Bhramar Mukherjee
Environmental health studies are increasingly measuring endogenous omics data ($ boldsymbol{M} $) to study intermediary biological pathways by which an exogenous exposure ($ boldsymbol{A} $) affects a health outcome ($ boldsymbol{Y} $), given confounders ($ boldsymbol{C} $). Mediation analysis is frequently performed to understand such mechanisms. If intermediary pathways are of interest, then there is likely literature establishing statistical and biological significance of the total effect, defined as the effect of $ boldsymbol{A} $ on $ boldsymbol{Y} $ given $ boldsymbol{C} $. For mediation models with continuous outcomes and mediators, we show that leveraging external summary-level information on the total effect can improve estimation efficiency of the direct and indirect effects. Moreover, the efficiency gain depends on the asymptotic partial $ R^{2} $ between the outcome ($ boldsymbol{Y}midboldsymbol{M},boldsymbol{A},boldsymbol{C} $) and total effect ($ boldsymbol{Y}midboldsymbol{A},boldsymbol{C} $) models, with smaller (larger) values benefiting direct (indirect) effect estimation. We propose a robust data-adaptive estimation procedure, Mediation with External Summary Statistic Information, to improve estimation efficiency in settings with congenial external information, while simultaneously protecting against bias in settings with incongenial external information. In congenial simulation scenarios, we observe relative efficiency gains for mediation effect estimation of up to 40%. We illustrate our methodology using data from the Puerto Rico Testsite for Exploring Contamination Threats, where Cytochrome p450 metabolites are hypothesized to mediate the effect of phthalate exposure on gestational age at delivery. External summary information on the total effect comes from a recently published pooled analysis of 16 studies. The proposed framework blends mediation analysis with emerging data integration techniques.
{"title":"Mediation with External Summary Statistic Information.","authors":"Jonathan Boss, Wei Hao, Amber Cathey, Barrett M Welch, Kelly K Ferguson, John D Meeker, Xiang Zhou, Jian Kang, Bhramar Mukherjee","doi":"10.1093/biostatistics/kxaf020","DOIUrl":"10.1093/biostatistics/kxaf020","url":null,"abstract":"<p><p>Environmental health studies are increasingly measuring endogenous omics data ($ boldsymbol{M} $) to study intermediary biological pathways by which an exogenous exposure ($ boldsymbol{A} $) affects a health outcome ($ boldsymbol{Y} $), given confounders ($ boldsymbol{C} $). Mediation analysis is frequently performed to understand such mechanisms. If intermediary pathways are of interest, then there is likely literature establishing statistical and biological significance of the total effect, defined as the effect of $ boldsymbol{A} $ on $ boldsymbol{Y} $ given $ boldsymbol{C} $. For mediation models with continuous outcomes and mediators, we show that leveraging external summary-level information on the total effect can improve estimation efficiency of the direct and indirect effects. Moreover, the efficiency gain depends on the asymptotic partial $ R^{2} $ between the outcome ($ boldsymbol{Y}midboldsymbol{M},boldsymbol{A},boldsymbol{C} $) and total effect ($ boldsymbol{Y}midboldsymbol{A},boldsymbol{C} $) models, with smaller (larger) values benefiting direct (indirect) effect estimation. We propose a robust data-adaptive estimation procedure, Mediation with External Summary Statistic Information, to improve estimation efficiency in settings with congenial external information, while simultaneously protecting against bias in settings with incongenial external information. In congenial simulation scenarios, we observe relative efficiency gains for mediation effect estimation of up to 40%. We illustrate our methodology using data from the Puerto Rico Testsite for Exploring Contamination Threats, where Cytochrome p450 metabolites are hypothesized to mediate the effect of phthalate exposure on gestational age at delivery. External summary information on the total effect comes from a recently published pooled analysis of 16 studies. The proposed framework blends mediation analysis with emerging data integration techniques.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12302958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144735537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf030
Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J van der Laan, Chris P Ponting, Sjoerd V Beentjes, Ava Khamseh
Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes and tight confidence intervals are expected, necessitates minimizing model-misspecification bias to increase power and control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including $ k $-point interactions among categorical variables in the presence of confounding and weak population dependence. $ k $-point interactions, or Average Interaction Effects (AIEs), are a direct generalization of the usual average treatment effect (ATE). We estimate genetic effects with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE $ k $-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms.
群体遗传学试图量化DNA变异与性状或疾病的关联,以及变异之间和与环境因素的相互作用。在大型队列中计算数以百万计的估计,其中预期的效应大小较小,置信区间较紧,需要最小化模型错配偏差,以增加功率并控制错误发现。我们提出了TarGene,一个统一的统计工作流程,用于遗传效应的半参数有效和双鲁棒估计,包括在混杂和弱种群依赖性存在下分类变量之间的$ k $点相互作用。k点相互作用,或平均相互作用效应(AIEs),是通常的平均治疗效果(ATE)的直接概括。我们使用交叉验证和/或加权版本的基于目标最小损失的估计器(TMLE)和一步估计器(OSE)来估计遗传效应。利用基于单元间遗传相关性的平台方差估计修正了数据单元间的相关性对方差估计的影响。我们提出了广泛的现实模拟,以展示功率,覆盖范围和控制类型I错误。我们的激励应用是在大型基因组数据库(如UK Biobank和All of Us)中有针对性地估计遗传对性状的影响,包括两点和高阶基因-基因和基因-环境相互作用。用于AIE $ k $点交互的所有交叉验证和/或加权TMLE和OSE,以及ate、条件ate及其函数,都在通用的Julia包TMLE. j1中实现。对于人口基因组学的高通量应用,我们提供开源的Nextflow管道和软件TarGene,与现代高性能和云计算平台无缝集成。
{"title":"Semiparametric efficient estimation of small genetic effects in large-scale population cohorts.","authors":"Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J van der Laan, Chris P Ponting, Sjoerd V Beentjes, Ava Khamseh","doi":"10.1093/biostatistics/kxaf030","DOIUrl":"10.1093/biostatistics/kxaf030","url":null,"abstract":"<p><p>Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes and tight confidence intervals are expected, necessitates minimizing model-misspecification bias to increase power and control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including $ k $-point interactions among categorical variables in the presence of confounding and weak population dependence. $ k $-point interactions, or Average Interaction Effects (AIEs), are a direct generalization of the usual average treatment effect (ATE). We estimate genetic effects with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE $ k $-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12479317/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}