Jennifer F Bobb, Stephen J Mooney, Maricela Cruz, Anne Vernez Moudon, Adam Drewnowski, David Arterburn, Andrea J Cook
Distributed lag models (DLMs) estimate the health effects of exposure over multiple time lags prior to the outcome and are widely used in time series studies. Applying DLMs to retrospective cohort studies is challenging due to inconsistent lengths of exposure history across participants, which is common when using electronic health record databases. A standard approach is to define subcohorts of individuals with some minimum exposure history, but this limits power and may amplify selection bias. We propose alternative full-cohort methods that use all available data while simultaneously enabling examination of the longest time lag estimable in the cohort. Through simulation studies, we find that restricting to a subcohort can lead to biased estimates of exposure effects due to confounding by correlated exposures at more distant lags. By contrast, full-cohort methods that incorporate multiple imputation of complete exposure histories can avoid this bias to efficiently estimate lagged and cumulative effects. Applying full-cohort DLMs to a study examining the association between residential density (a proxy for walkability) over 12 years and body weight, we find evidence of an immediate effect in the prior 1-2 years. We also observed an association at the maximal lag considered (12 years prior), which we posit reflects an earlier ($ge$12 years) or incrementally increasing prior effect over time. DLMs can be efficiently incorporated within retrospective cohort studies to identify critical windows of exposure.
{"title":"Distributed lag models for retrospective cohort data with application to a study of built environment and body weight.","authors":"Jennifer F Bobb, Stephen J Mooney, Maricela Cruz, Anne Vernez Moudon, Adam Drewnowski, David Arterburn, Andrea J Cook","doi":"10.1093/biomtc/ujae166","DOIUrl":"10.1093/biomtc/ujae166","url":null,"abstract":"<p><p>Distributed lag models (DLMs) estimate the health effects of exposure over multiple time lags prior to the outcome and are widely used in time series studies. Applying DLMs to retrospective cohort studies is challenging due to inconsistent lengths of exposure history across participants, which is common when using electronic health record databases. A standard approach is to define subcohorts of individuals with some minimum exposure history, but this limits power and may amplify selection bias. We propose alternative full-cohort methods that use all available data while simultaneously enabling examination of the longest time lag estimable in the cohort. Through simulation studies, we find that restricting to a subcohort can lead to biased estimates of exposure effects due to confounding by correlated exposures at more distant lags. By contrast, full-cohort methods that incorporate multiple imputation of complete exposure histories can avoid this bias to efficiently estimate lagged and cumulative effects. Applying full-cohort DLMs to a study examining the association between residential density (a proxy for walkability) over 12 years and body weight, we find evidence of an immediate effect in the prior 1-2 years. We also observed an association at the maximal lag considered (12 years prior), which we posit reflects an earlier ($ge$12 years) or incrementally increasing prior effect over time. DLMs can be efficiently incorporated within retrospective cohort studies to identify critical windows of exposure.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760659/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143031922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by the need for computationally tractable spatial methods in neuroimaging studies, we develop a distributed and integrated framework for estimation and inference of Gaussian process model parameters with ultra-high-dimensional likelihoods. We propose a shift in viewpoint from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights into autism spectrum disorder from the autism brain imaging data exchange.
{"title":"Distributed model building and recursive integration for big spatial data modeling.","authors":"Emily C Hector, Brian J Reich, Ani Eloyan","doi":"10.1093/biomtc/ujae159","DOIUrl":"https://doi.org/10.1093/biomtc/ujae159","url":null,"abstract":"<p><p>Motivated by the need for computationally tractable spatial methods in neuroimaging studies, we develop a distributed and integrated framework for estimation and inference of Gaussian process model parameters with ultra-high-dimensional likelihoods. We propose a shift in viewpoint from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights into autism spectrum disorder from the autism brain imaging data exchange.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142969462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camila Olarte Parra, Rhian M Daniel, David Wright, Jonathan W Bartlett
The ICH E9 addendum on estimands in clinical trials provides a framework for precisely defining the treatment effect that is to be estimated, but says little about estimation methods. Here, we report analyses of a clinical trial in type 2 diabetes, targeting the effects of randomized treatment, handling rescue treatment and discontinuation of randomized treatment using the so-called hypothetical strategy. We show how this can be estimated using mixed models for repeated measures, multiple imputation, inverse probability of treatment weighting, G-formula, and G-estimation. We describe their assumptions and practical details of their implementation using packages in R. We report the results of these analyses, broadly finding similar estimates and standard errors across the estimators. We discuss various considerations relevant when choosing an estimation approach, including computational time, how to handle missing data, whether to include post intercurrent event data in the analysis, whether and how to adjust for additional time-varying confounders, and whether and how to model different types of intercurrent event data separately.
{"title":"Estimating hypothetical estimands with causal inference and missing data estimators in a diabetes trial case study.","authors":"Camila Olarte Parra, Rhian M Daniel, David Wright, Jonathan W Bartlett","doi":"10.1093/biomtc/ujae167","DOIUrl":"https://doi.org/10.1093/biomtc/ujae167","url":null,"abstract":"<p><p>The ICH E9 addendum on estimands in clinical trials provides a framework for precisely defining the treatment effect that is to be estimated, but says little about estimation methods. Here, we report analyses of a clinical trial in type 2 diabetes, targeting the effects of randomized treatment, handling rescue treatment and discontinuation of randomized treatment using the so-called hypothetical strategy. We show how this can be estimated using mixed models for repeated measures, multiple imputation, inverse probability of treatment weighting, G-formula, and G-estimation. We describe their assumptions and practical details of their implementation using packages in R. We report the results of these analyses, broadly finding similar estimates and standard errors across the estimators. We discuss various considerations relevant when choosing an estimation approach, including computational time, how to handle missing data, whether to include post intercurrent event data in the analysis, whether and how to adjust for additional time-varying confounders, and whether and how to model different types of intercurrent event data separately.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143051435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating P-values and also a more recent general method of combining confidence distributions, but makes generalizations to handle dependent tests. The proposed framework ensures rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination method for dependent tests, referred to as the vanilla Cauchy combination in this article, can be viewed as a special case within our framework. Moreover, the proposed framework provides a way to address the problem when the distributional assumptions underlying the vanilla Cauchy combination are violated. Our numerical results demonstrate that ignoring the dependence among the to-be-combined components may lead to a severe size distortion phenomenon. Compared to the existing P-value combination methods, including the vanilla Cauchy combination method and other methods, the proposed combination framework is flexible and can be adapted to handle the dependence accurately and utilizes the information efficiently to construct tests with accurate size and enhanced power. The development is applied to the microbiome association studies, where we aggregate information from multiple existing tests using the same dataset. The combined tests harness the strengths of each individual test across a wide range of alternative spaces, enabling more efficient and meaningful discoveries of vital microbiome associations.
{"title":"A unified combination framework for dependent tests with applications to microbiome association studies.","authors":"Xiufan Yu, Linjun Zhang, Arun Srinivasan, Min-Ge Xie, Lingzhou Xue","doi":"10.1093/biomtc/ujaf001","DOIUrl":"10.1093/biomtc/ujaf001","url":null,"abstract":"<p><p>We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating P-values and also a more recent general method of combining confidence distributions, but makes generalizations to handle dependent tests. The proposed framework ensures rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination method for dependent tests, referred to as the vanilla Cauchy combination in this article, can be viewed as a special case within our framework. Moreover, the proposed framework provides a way to address the problem when the distributional assumptions underlying the vanilla Cauchy combination are violated. Our numerical results demonstrate that ignoring the dependence among the to-be-combined components may lead to a severe size distortion phenomenon. Compared to the existing P-value combination methods, including the vanilla Cauchy combination method and other methods, the proposed combination framework is flexible and can be adapted to handle the dependence accurately and utilizes the information efficiently to construct tests with accurate size and enhanced power. The development is applied to the microbiome association studies, where we aggregate information from multiple existing tests using the same dataset. The combined tests harness the strengths of each individual test across a wide range of alternative spaces, enabling more efficient and meaningful discoveries of vital microbiome associations.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783248/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The abundance of various cell types can vary significantly among patients with varying phenotypes and even those with the same phenotype. Recent scientific advancements provide mounting evidence that other clinical variables, such as age, gender, and lifestyle habits, can also influence the abundance of certain cell types. However, current methods for integrating single-cell-level omics data with clinical variables are inadequate. In this study, we propose a regularized Bayesian Dirichlet-multinomial regression framework to investigate the relationship between single-cell RNA sequencing data and patient-level clinical data. Additionally, the model employs a novel hierarchical tree structure to identify such relationships at different cell-type levels. Our model successfully uncovers significant associations between specific cell types and clinical variables across three distinct diseases: pulmonary fibrosis, COVID-19, and non-small cell lung cancer. This integrative analysis provides biological insights and could potentially inform clinical interventions for various diseases.
{"title":"A regularized Bayesian Dirichlet-multinomial regression model for integrating single-cell-level omics and patient-level clinical study data.","authors":"Yanghong Guo, Lei Yu, Lei Guo, Lin Xu, Qiwei Li","doi":"10.1093/biomtc/ujaf005","DOIUrl":"10.1093/biomtc/ujaf005","url":null,"abstract":"<p><p>The abundance of various cell types can vary significantly among patients with varying phenotypes and even those with the same phenotype. Recent scientific advancements provide mounting evidence that other clinical variables, such as age, gender, and lifestyle habits, can also influence the abundance of certain cell types. However, current methods for integrating single-cell-level omics data with clinical variables are inadequate. In this study, we propose a regularized Bayesian Dirichlet-multinomial regression framework to investigate the relationship between single-cell RNA sequencing data and patient-level clinical data. Additionally, the model employs a novel hierarchical tree structure to identify such relationships at different cell-type levels. Our model successfully uncovers significant associations between specific cell types and clinical variables across three distinct diseases: pulmonary fibrosis, COVID-19, and non-small cell lung cancer. This integrative analysis provides biological insights and could potentially inform clinical interventions for various diseases.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783250/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael R Schwob, Mevin B Hooten, Vagheesh Narasimhan
Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe.
机制统计模型通常用于研究生物过程的流动。例如,在景观遗传学中,目的是推断支配种群基因流动的空间机制。景观遗传学中的现有统计方法并不考虑数据的时间依赖性,而且计算量可能过大。我们采用贝叶斯分层二元模型来推断机制,该模型能很好地扩展大型数据集,并考虑空间和时间依赖性。我们为二元模型构建了一个由时空数据组成的全连接网络,并使用归一化复合似然来解释空间和时间上的依赖结构。我们建立了一个二元模型来解释物理统计模型中常见的物理机制,并将我们的方法应用于古人类 DNA 数据,以推断影响青铜时代欧洲人类运动的机制。
{"title":"Composite dyadic models for spatio-temporal data.","authors":"Michael R Schwob, Mevin B Hooten, Vagheesh Narasimhan","doi":"10.1093/biomtc/ujae107","DOIUrl":"10.1093/biomtc/ujae107","url":null,"abstract":"<p><p>Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A stepped wedge design is an unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs is a standard practice to evaluate treatment effects accounting for clustering and adjusting for covariates, their properties under misspecification have not been systematically explored. In this article, we focus on model-based methods, including linear mixed models and generalized estimating equations with an independence, simple exchangeable, or nested exchangeable working correlation structure. We study when a potentially misspecified working model can offer consistent estimation of the marginal treatment effect estimands, which are defined nonparametrically with potential outcomes and may be functions of calendar time and/or exposure time. We prove a central result that consistency for nonparametric estimands usually requires a correctly specified treatment effect structure, but generally not the remaining aspects of the working model (functional form of covariates, random effects, and error distribution), and valid inference is obtained via the sandwich variance estimator. Furthermore, an additional g-computation step is required to achieve model-robust inference under non-identity link functions or for ratio estimands. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge cluster randomized trial.
阶梯楔形设计是一种单向交叉设计,在这种设计中,分组被随机分配到不同的治疗序列中。基于模型的阶梯楔形设计分析是评估治疗效果的标准做法,它考虑了聚类并调整了协变量,但尚未系统地探讨其在错误规范下的特性。本文重点讨论基于模型的方法,包括线性混合模型和具有独立、简单可交换或嵌套可交换工作相关结构的广义估计方程。我们研究了一个可能被错误定义的工作模型在什么情况下可以提供边际治疗效果估计值的一致性估计,边际治疗效果估计值是用潜在结果非参数定义的,可能是日历时间和/或暴露时间的函数。我们证明了一个核心结果,即非参数估计的一致性通常需要一个正确指定的治疗效果结构,但一般不需要工作模型的其他方面(协变量的函数形式、随机效应和误差分布),并且可以通过三明治方差估计器获得有效推论。此外,还需要额外的 g 计算步骤,才能在非同一性联系函数或比率估计值条件下实现模型可靠的推断。通过几个模拟实验和对已完成的阶梯楔形群随机试验的重新分析,对理论结果进行了说明。
{"title":"How to achieve model-robust inference in stepped wedge trials with model-based methods?","authors":"Bingkai Wang, Xueqi Wang, Fan Li","doi":"10.1093/biomtc/ujae123","DOIUrl":"10.1093/biomtc/ujae123","url":null,"abstract":"<p><p>A stepped wedge design is an unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs is a standard practice to evaluate treatment effects accounting for clustering and adjusting for covariates, their properties under misspecification have not been systematically explored. In this article, we focus on model-based methods, including linear mixed models and generalized estimating equations with an independence, simple exchangeable, or nested exchangeable working correlation structure. We study when a potentially misspecified working model can offer consistent estimation of the marginal treatment effect estimands, which are defined nonparametrically with potential outcomes and may be functions of calendar time and/or exposure time. We prove a central result that consistency for nonparametric estimands usually requires a correctly specified treatment effect structure, but generally not the remaining aspects of the working model (functional form of covariates, random effects, and error distribution), and valid inference is obtained via the sandwich variance estimator. Furthermore, an additional g-computation step is required to achieve model-robust inference under non-identity link functions or for ratio estimands. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge cluster randomized trial.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our proposed test using a simulation study and illustrate the method using data from two distinct AIDS clinical trials.
{"title":"Group sequential testing of a treatment effect using a surrogate marker.","authors":"Layla Parast, Jay Bartroff","doi":"10.1093/biomtc/ujae108","DOIUrl":"10.1093/biomtc/ujae108","url":null,"abstract":"<p><p>The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our proposed test using a simulation study and illustrate the method using data from two distinct AIDS clinical trials.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11459368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142387635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron J Molstad, Yanwei Cai, Alexander P Reiner, Charles Kooperberg, Wei Sun, Li Hsu
Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model that makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population.
基于基因预测蛋白表达的特定祖先全蛋白质组关联研究(PWAS)可以揭示某些祖先群体特有的复杂疾病病因。这些研究需要特定祖先的蛋白质表达模型作为 SNP 基因型的函数。为了改善在基因组研究中历来代表性不足的祖先人群的蛋白质表达预测,我们提出了一种新的惩罚性最大似然估计器,用于拟合祖先特异性联合蛋白质数量性状位点模型。我们的估计器借用了不同祖先群体的信息,同时允许异质性误差方差和回归系数。我们提出了模型的另一种参数化方法,使目标函数具有凸性和惩罚尺度不变性。为了提高计算效率,我们提出了一种近似版本的方法,并对其理论特性进行了研究。我们的方法大大提高了非洲血统个体蛋白质表达预测的准确性,并在下游的 PWAS 分析中发现了非洲血统人群中蛋白质表达与血脂特征之间的多种关联。
{"title":"Heterogeneity-aware integrative regression for ancestry-specific association studies.","authors":"Aaron J Molstad, Yanwei Cai, Alexander P Reiner, Charles Kooperberg, Wei Sun, Li Hsu","doi":"10.1093/biomtc/ujae109","DOIUrl":"10.1093/biomtc/ujae109","url":null,"abstract":"<p><p>Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model that makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11492996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142457175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a robust alternative to the maximum likelihood estimator (MLE) for the polytomous logistic regression model, known as the family of minimum Rènyi Pseudodistance (RP) estimators. The proposed minimum RP estimators are parametrized by a tuning parameter $alpha ge 0$, and include the MLE as a special case when $alpha =0$. These estimators, along with a family of RP-based Wald-type tests, are shown to exhibit superior performance in the presence of misclassification errors. The paper includes an extensive simulation study and a real data example to illustrate the robustness of these proposed statistics.
本文提出了多态逻辑回归模型最大似然估计器(MLE)的稳健替代方法,即最小雷尼伪距(RP)估计器系列。所提出的最小 RP 估计器由一个调整参数 $alpha ge 0$ 参数化,并将 MLE 作为 $alpha =0$ 时的特例。这些估计器以及一系列基于 RP 的沃尔德类型检验,在存在误分类误差的情况下表现出卓越的性能。论文包括一项广泛的模拟研究和一个真实数据示例,以说明这些拟议统计量的稳健性。
{"title":"A new robust approach for the polytomous logistic regression model based on Rényi's pseudodistances.","authors":"Elena Castilla","doi":"10.1093/biomtc/ujae125","DOIUrl":"https://doi.org/10.1093/biomtc/ujae125","url":null,"abstract":"<p><p>This paper presents a robust alternative to the maximum likelihood estimator (MLE) for the polytomous logistic regression model, known as the family of minimum Rènyi Pseudodistance (RP) estimators. The proposed minimum RP estimators are parametrized by a tuning parameter $alpha ge 0$, and include the MLE as a special case when $alpha =0$. These estimators, along with a family of RP-based Wald-type tests, are shown to exhibit superior performance in the presence of misclassification errors. The paper includes an extensive simulation study and a real data example to illustrate the robustness of these proposed statistics.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 4","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}