Annika Jaitner, Krasimira Tsaneva-Atanasova, Rachel M. Freathy, Jack Bowden
Mendelian randomization (MR) is a framework to estimate the causal effect of a modifiable health exposure, drug target or pharmaceutical intervention on a downstream outcome by using genetic variants as instrumental variables. A crucial assumption allowing estimation of the average causal effect in MR, termed homogeneity, is that the causal effect does not vary across levels of any instrument used in the analysis. In contrast, the science of pharmacogenetics seeks to actively uncover and exploit genetically driven effect heterogeneity for the purposes of precision medicine. In this study, we consider a recently proposed method for performing pharmacogenetic analysis on observational data—the Triangulation WIthin a STudy (TWIST) framework—and explore how it can be combined with traditional MR approaches to properly characterise average causal effects and genetically driven effect heterogeneity. We propose two new methods which not only estimate the genetically driven effect heterogeneity but also enable the estimation of a causal effect in the genetic group with and without the risk allele separately. Both methods utilise homogeneity-respecting and homogeneity-violating genetic variants and rely on a different set of assumptions. Using data from the ALSPAC study, we apply our new methods to estimate the causal effect of smoking before and during pregnancy on offspring birth weight in mothers whose genetics mean they find it (relatively) easier or harder to quit smoking.
孟德尔随机化(MR)是一种利用遗传变异作为工具变量来估算可改变的健康暴露、药物目标或药物干预对下游结果的因果效应的框架。在 MR 中,估算平均因果效应的一个重要假设(称为同质性)是,因果效应不会因分析中使用的任何工具的不同水平而变化。与此相反,药物遗传学试图积极发现和利用基因驱动的效应异质性,以实现精准医疗的目的。在本研究中,我们考虑了最近提出的一种对观察数据进行药物遗传学分析的方法--TWIST(Triangulation WIthin a STudy)框架--并探讨了如何将其与传统的 MR 方法相结合,以正确描述平均因果效应和基因驱动的效应异质性。我们提出了两种新方法,它们不仅能估算基因驱动效应异质性,还能分别估算有风险等位基因和无风险等位基因基因组的因果效应。这两种方法都利用了尊重同质性和违反同质性的遗传变异,并依赖于不同的假设。利用 ALSPAC 研究的数据,我们运用新方法估算了母亲在怀孕前和怀孕期间吸烟对后代出生体重的因果效应,这些母亲的遗传意味着戒烟(相对)更容易或更困难。
{"title":"Exploring and Accounting for Genetically Driven Effect Heterogeneity in Mendelian Randomization","authors":"Annika Jaitner, Krasimira Tsaneva-Atanasova, Rachel M. Freathy, Jack Bowden","doi":"10.1002/gepi.22587","DOIUrl":"10.1002/gepi.22587","url":null,"abstract":"<p>Mendelian randomization (MR) is a framework to estimate the causal effect of a modifiable health exposure, drug target or pharmaceutical intervention on a downstream outcome by using genetic variants as instrumental variables. A crucial assumption allowing estimation of the average causal effect in MR, termed <i>homogeneity</i>, is that the causal effect does not vary across levels of any instrument used in the analysis. In contrast, the science of pharmacogenetics seeks to actively uncover and exploit genetically driven effect heterogeneity for the purposes of precision medicine. In this study, we consider a recently proposed method for performing pharmacogenetic analysis on observational data—the Triangulation WIthin a STudy (TWIST) framework—and explore how it can be combined with traditional MR approaches to properly characterise average causal effects and genetically driven effect heterogeneity. We propose two new methods which not only estimate the genetically driven effect heterogeneity but also enable the estimation of a causal effect in the genetic group with and without the risk allele separately. Both methods utilise homogeneity-respecting and homogeneity-violating genetic variants and rely on a different set of assumptions. Using data from the ALSPAC study, we apply our new methods to estimate the causal effect of smoking before and during pregnancy on offspring birth weight in mothers whose genetics mean they find it (relatively) easier or harder to quit smoking.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"49 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22587","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142284341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoran Liang, Ninon Mounier, Nicolas Apfel, Sara Khalid, Timothy M. Frayling, Jack Bowden
Mendelian randomization (MR) is an epidemiological approach that utilizes genetic variants as instrumental variables to estimate the causal effect of an exposure on a health outcome. This paper investigates an MR scenario in which genetic variants aggregate into clusters that identify heterogeneous causal effects. Such variant clusters are likely to emerge if they affect the exposure and outcome via distinct biological pathways. In the multi-outcome MR framework, where a shared exposure causally impacts several disease outcomes simultaneously, these variant clusters can provide insights into the common disease-causing mechanisms underpinning the co-occurrence of multiple long-term conditions, a phenomenon known as multimorbidity. To identify such variant clusters, we adapt the general method of agglomerative hierarchical clustering to multi-sample summary-data MR setup, enabling cluster detection based on variant-specific ratio estimates. Particularly, we tailor the method for multi-outcome MR to aid in elucidating the causal pathways through which a common risk factor contributes to multiple morbidities. We show in simulations that our “MR-AHC” method detects clusters with high accuracy, outperforming the existing methods. We apply the method to investigate the causal effects of high body fat percentage on type 2 diabetes and osteoarthritis, uncovering interconnected cellular processes underlying this multimorbid disease pair.
{"title":"Using clustering of genetic variants in Mendelian randomization to interrogate the causal pathways underlying multimorbidity from a common risk factor","authors":"Xiaoran Liang, Ninon Mounier, Nicolas Apfel, Sara Khalid, Timothy M. Frayling, Jack Bowden","doi":"10.1002/gepi.22582","DOIUrl":"10.1002/gepi.22582","url":null,"abstract":"<p>Mendelian randomization (MR) is an epidemiological approach that utilizes genetic variants as instrumental variables to estimate the causal effect of an exposure on a health outcome. This paper investigates an MR scenario in which genetic variants aggregate into clusters that identify heterogeneous causal effects. Such variant clusters are likely to emerge if they affect the exposure and outcome via distinct biological pathways. In the multi-outcome MR framework, where a shared exposure causally impacts several disease outcomes simultaneously, these variant clusters can provide insights into the common disease-causing mechanisms underpinning the co-occurrence of multiple long-term conditions, a phenomenon known as multimorbidity. To identify such variant clusters, we adapt the general method of agglomerative hierarchical clustering to multi-sample summary-data MR setup, enabling cluster detection based on variant-specific ratio estimates. Particularly, we tailor the method for multi-outcome MR to aid in elucidating the causal pathways through which a common risk factor contributes to multiple morbidities. We show in simulations that our “MR-AHC” method detects clusters with high accuracy, outperforming the existing methods. We apply the method to investigate the causal effects of high body fat percentage on type 2 diabetes and osteoarthritis, uncovering interconnected cellular processes underlying this multimorbid disease pair.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"49 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22582","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141975524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zoe E. Reed, Robyn E. Wootton, Jasmine N. Khouja, Tom G. Richardson, Eleanor Sanderson, George Davey Smith, Marcus R. Munafò
Genetic variants used as instruments for exposures in Mendelian randomisation (MR) analyses may have horizontal pleiotropic effects (i.e., influence outcomes via pathways other than through the exposure), which can undermine the validity of results. We examined the extent of this using smoking behaviours as an example. We first ran a phenome-wide association study in UK Biobank, using a smoking initiation genetic instrument. From the most strongly associated phenotypes, we selected those we considered could either plausibly or not plausibly be caused by smoking. We examined associations between genetic instruments for smoking initiation, smoking heaviness and lifetime smoking and these phenotypes in UK Biobank and the Avon Longitudinal Study of Parents and Children (ALSPAC). We conducted negative control analyses among never smokers, including children. We found evidence that smoking-related genetic instruments were associated with phenotypes not plausibly caused by smoking in UK Biobank and (to a lesser extent) ALSPAC. We observed associations with phenotypes among never smokers. Our results demonstrate that smoking-related genetic risk scores are associated with unexpected phenotypes that are less plausibly downstream of smoking. This may reflect horizontal pleiotropy in these genetic risk scores, and we would encourage researchers to exercise caution this when using these and genetic risk scores for other complex behavioural exposures. We outline approaches that could be taken to consider this and overcome issues caused by potential horizontal pleiotropy, for example, in genetically informed causal inference analyses (e.g., MR) it is important to consider negative control outcomes and triangulation approaches, to avoid arriving at incorrect conclusions.
{"title":"Exploring pleiotropy in Mendelian randomisation analyses: What are genetic variants associated with ‘cigarette smoking initiation’ really capturing?","authors":"Zoe E. Reed, Robyn E. Wootton, Jasmine N. Khouja, Tom G. Richardson, Eleanor Sanderson, George Davey Smith, Marcus R. Munafò","doi":"10.1002/gepi.22583","DOIUrl":"10.1002/gepi.22583","url":null,"abstract":"<p>Genetic variants used as instruments for exposures in Mendelian randomisation (MR) analyses may have horizontal pleiotropic effects (i.e., influence outcomes via pathways other than through the exposure), which can undermine the validity of results. We examined the extent of this using smoking behaviours as an example. We first ran a phenome-wide association study in UK Biobank, using a smoking initiation genetic instrument. From the most strongly associated phenotypes, we selected those we considered could either plausibly or not plausibly be caused by smoking. We examined associations between genetic instruments for smoking initiation, smoking heaviness and lifetime smoking and these phenotypes in UK Biobank and the Avon Longitudinal Study of Parents and Children (ALSPAC). We conducted negative control analyses among never smokers, including children. We found evidence that smoking-related genetic instruments were associated with phenotypes not plausibly caused by smoking in UK Biobank and (to a lesser extent) ALSPAC. We observed associations with phenotypes among never smokers. Our results demonstrate that smoking-related genetic risk scores are associated with unexpected phenotypes that are less plausibly downstream of smoking. This may reflect horizontal pleiotropy in these genetic risk scores, and we would encourage researchers to exercise caution this when using these and genetic risk scores for other complex behavioural exposures. We outline approaches that could be taken to consider this and overcome issues caused by potential horizontal pleiotropy, for example, in genetically informed causal inference analyses (e.g., MR) it is important to consider negative control outcomes and triangulation approaches, to avoid arriving at incorrect conclusions.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"49 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7616876/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141888953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chin Yang Shapland, Apostolos Gkatzionis, Gibran Hemani, Kate Tilling
Observational studies are rarely representative of their target population because there are known and unknown factors that affect an individual's choice to participate (the selection mechanism). Selection can cause bias in a given analysis if the outcome is related to selection (conditional on the other variables in the model). Detecting and adjusting for selection bias in practice typically requires access to data on nonselected individuals. Here, we propose methods to detect selection bias in genetic studies by comparing correlations among genetic variants in the selected sample to those expected under no selection. We examine the use of four hypothesis tests to identify induced associations between genetic variants in the selected sample. We evaluate these approaches in Monte Carlo simulations. Finally, we use these approaches in an applied example using data from the UK Biobank (UKBB). The proposed tests suggested an association between alcohol consumption and selection into UKBB. Hence, UKBB analyses with alcohol consumption as the exposure or outcome may be biased by this selection.
{"title":"Use of genetic correlations to examine selection bias","authors":"Chin Yang Shapland, Apostolos Gkatzionis, Gibran Hemani, Kate Tilling","doi":"10.1002/gepi.22584","DOIUrl":"10.1002/gepi.22584","url":null,"abstract":"<p>Observational studies are rarely representative of their target population because there are known and unknown factors that affect an individual's choice to participate (the selection mechanism). Selection can cause bias in a given analysis if the outcome is related to selection (conditional on the other variables in the model). Detecting and adjusting for selection bias in practice typically requires access to data on nonselected individuals. Here, we propose methods to detect selection bias in genetic studies by comparing correlations among genetic variants in the selected sample to those expected under no selection. We examine the use of four hypothesis tests to identify induced associations between genetic variants in the selected sample. We evaluate these approaches in Monte Carlo simulations. Finally, we use these approaches in an applied example using data from the UK Biobank (UKBB). The proposed tests suggested an association between alcohol consumption and selection into UKBB. Hence, UKBB analyses with alcohol consumption as the exposure or outcome may be biased by this selection.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"49 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22584","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141855281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georg Hahn, Dmitry Prokopenko, Julian Hecker, Sharon M. Lutz, Kristina Mullin, Rudolph E. Tanzi, Stacia DeSantis, Christoph Lange
The prediction of the susceptibility of an individual to a certain disease is an important and timely research area. An established technique is to estimate the risk of an individual with the help of an integrated risk model, that is, a polygenic risk score with added epidemiological covariates. However, integrated risk models do not capture any time dependence, and may provide a point estimate of the relative risk with respect to a reference population. The aim of this work is twofold. First, we explore and advocate the idea of predicting the time-dependent hazard and survival (defined as disease-free time) of an individual for the onset of a disease. This provides a practitioner with a much more differentiated view of absolute survival as a function of time. Second, to compute the time-dependent risk of an individual, we use published methodology to fit a Cox's proportional hazard model to data from a genetic SNP study of time to Alzheimer's disease (AD) onset, using the lasso to incorporate further epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status, 10 leading principal components, and selected genomic loci. We apply the lasso for Cox's proportional hazards to a data set of 6792 AD patients (composed of 4102 cases and 2690 controls) and 87 covariates. We demonstrate that fitting a lasso model for Cox's proportional hazards allows one to obtain more accurate survival curves than with state-of-the-art (likelihood-based) methods. Moreover, the methodology allows one to obtain personalized survival curves for a patient, thus giving a much more differentiated view of the expected progression of a disease than the view offered by integrated risk models. The runtime to compute personalized survival curves is under a minute for the entire data set of AD patients, thus enabling it to handle datasets with 60,000–100,000 subjects in less than 1 h.
{"title":"Polygenic hazard score models for the prediction of Alzheimer's free survival using the lasso for Cox's proportional hazards model","authors":"Georg Hahn, Dmitry Prokopenko, Julian Hecker, Sharon M. Lutz, Kristina Mullin, Rudolph E. Tanzi, Stacia DeSantis, Christoph Lange","doi":"10.1002/gepi.22581","DOIUrl":"10.1002/gepi.22581","url":null,"abstract":"<p>The prediction of the susceptibility of an individual to a certain disease is an important and timely research area. An established technique is to estimate the risk of an individual with the help of an integrated risk model, that is, a polygenic risk score with added epidemiological covariates. However, integrated risk models do not capture any time dependence, and may provide a point estimate of the relative risk with respect to a reference population. The aim of this work is twofold. First, we explore and advocate the idea of predicting the time-dependent hazard and survival (defined as disease-free time) of an individual for the onset of a disease. This provides a practitioner with a much more differentiated view of absolute survival as a function of time. Second, to compute the time-dependent risk of an individual, we use published methodology to fit a Cox's proportional hazard model to data from a genetic SNP study of time to Alzheimer's disease (AD) onset, using the lasso to incorporate further epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status, 10 leading principal components, and selected genomic loci. We apply the lasso for Cox's proportional hazards to a data set of 6792 AD patients (composed of 4102 cases and 2690 controls) and 87 covariates. We demonstrate that fitting a lasso model for Cox's proportional hazards allows one to obtain more accurate survival curves than with state-of-the-art (likelihood-based) methods. Moreover, the methodology allows one to obtain personalized survival curves for a patient, thus giving a much more differentiated view of the expected progression of a disease than the view offered by integrated risk models. The runtime to compute personalized survival curves is under a minute for the entire data set of AD patients, thus enabling it to handle datasets with 60,000–100,000 subjects in less than 1 h.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"49 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141563235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christina Nieuwoudt, Fabiha Binte Farooq, Angela Brooks-Wilson, Alexandre Bureau, Jinko Graham
Family-based sequencing studies are increasingly used to find rare genetic variants of high risk for disease traits with familial clustering. In some studies, families with multiple disease subtypes are collected and the exomes of affected relatives are sequenced for shared rare variants (RVs). Since different families can harbor different causal variants and each family harbors many RVs, tests to detect causal variants can have low power in this study design. Our goal is rather to prioritize shared variants for further investigation by, for example, pathway analyses or functional studies. The transmission-disequilibrium test prioritizes variants based on departures from Mendelian transmission in parent–child trios. Extending this idea to families, we propose methods to prioritize RVs shared in affected relatives with two disease subtypes, with one subtype more heritable than the other. Global approaches condition on a variant being observed in the study and assume a known probability of carrying a causal variant. In contrast, local approaches condition on a variant being observed in specific families to eliminate the carrier probability. Our simulation results indicate that global approaches are robust to misspecification of the carrier probability and prioritize more effectively than local approaches even when the carrier probability is misspecified.
{"title":"Statistics to prioritize rare variants in family-based sequencing studies with disease subtypes","authors":"Christina Nieuwoudt, Fabiha Binte Farooq, Angela Brooks-Wilson, Alexandre Bureau, Jinko Graham","doi":"10.1002/gepi.22579","DOIUrl":"10.1002/gepi.22579","url":null,"abstract":"<p>Family-based sequencing studies are increasingly used to find rare genetic variants of high risk for disease traits with familial clustering. In some studies, families with multiple disease subtypes are collected and the exomes of affected relatives are sequenced for shared rare variants (RVs). Since different families can harbor different causal variants and each family harbors many RVs, tests to detect causal variants can have low power in this study design. Our goal is rather to prioritize shared variants for further investigation by, for example, pathway analyses or functional studies. The transmission-disequilibrium test prioritizes variants based on departures from Mendelian transmission in parent–child trios. Extending this idea to families, we propose methods to prioritize RVs shared in affected relatives with two disease subtypes, with one subtype more heritable than the other. Global approaches condition on a variant being observed in the study and assume a known probability of carrying a causal variant. In contrast, local approaches condition on a variant being observed in specific families to eliminate the carrier probability. Our simulation results indicate that global approaches are robust to misspecification of the carrier probability and prioritize more effectively than local approaches even when the carrier probability is misspecified.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 7","pages":"324-343"},"PeriodicalIF":1.7,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22579","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141467490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brian D. Chen, Chanhwa Lee, Amanda L. Tapia, Alexander P. Reiner, Hua Tang, Charles Kooperberg, JoAnn E. Manson, Yun Li, Laura M. Raffield
In most Proteome-Wide Association Studies (PWAS), variants near the protein-coding gene (±1 Mb), also known as cis single nucleotide polymorphisms (SNPs), are used to predict protein levels, which are then tested for association with phenotypes. However, proteins can be regulated through variants outside of the cis region. An intermediate GWAS step to identify protein quantitative trait loci (pQTL) allows for the inclusion of trans SNPs outside the cis region in protein-level prediction models. Here, we assess the prediction of 540 proteins in 1002 individuals from the Women's Health Initiative (WHI), split equally into a GWAS set, an elastic net training set, and a testing set. We compared the testing r2 between measured and predicted protein levels using this proposed approach, to the testing r2 using only cis SNPs. The two methods usually resulted in similar testing r2, but some proteins showed a significant increase in testing r2 with our method. For example, for cartilage acidic protein 1, the testing r2 increased from 0.101 to 0.351. We also demonstrate reproducible findings for predicted protein association with lipid and blood cell traits in WHI participants without proteomics data and in UK Biobank utilizing our PWAS weights.
{"title":"Proteome-wide association study using cis and trans variants and applied to blood cell and lipid-related traits in the Women's Health Initiative study","authors":"Brian D. Chen, Chanhwa Lee, Amanda L. Tapia, Alexander P. Reiner, Hua Tang, Charles Kooperberg, JoAnn E. Manson, Yun Li, Laura M. Raffield","doi":"10.1002/gepi.22578","DOIUrl":"10.1002/gepi.22578","url":null,"abstract":"<p>In most Proteome-Wide Association Studies (PWAS), variants near the protein-coding gene (±1 Mb), also known as <i>cis</i> single nucleotide polymorphisms (SNPs), are used to predict protein levels, which are then tested for association with phenotypes. However, proteins can be regulated through variants outside of the cis region. An intermediate GWAS step to identify protein quantitative trait loci (pQTL) allows for the inclusion of trans SNPs outside the cis region in protein-level prediction models. Here, we assess the prediction of 540 proteins in 1002 individuals from the Women's Health Initiative (WHI), split equally into a GWAS set, an elastic net training set, and a testing set. We compared the testing <i>r</i><sup>2</sup> between measured and predicted protein levels using this proposed approach, to the testing <i>r</i><sup>2</sup> using only cis SNPs. The two methods usually resulted in similar testing <i>r</i><sup>2</sup>, but some proteins showed a significant increase in testing <i>r</i><sup>2</sup> with our method. For example, for cartilage acidic protein 1, the testing <i>r</i><sup>2</sup> increased from 0.101 to 0.351. We also demonstrate reproducible findings for predicted protein association with lipid and blood cell traits in WHI participants without proteomics data and in UK Biobank utilizing our PWAS weights.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 7","pages":"310-323"},"PeriodicalIF":1.7,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141467489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lai Jiang, Jiayi Shen, Burcu F. Darst, Christopher A. Haiman, Nicholas Mancuso, David V. Conti
Instrumental variable (IV) analysis has been widely applied in epidemiology to infer causal relationships using observational data. Genetic variants can also be viewed as valid IVs in Mendelian randomization and transcriptome-wide association studies. However, most multivariate IV approaches cannot scale to high-throughput experimental data. Here, we leverage the flexibility of our previous work, a hierarchical model that jointly analyzes marginal summary statistics (hJAM), to a scalable framework (SHA-JAM) that can be applied to a large number of intermediates and a large number of correlated genetic variants—situations often encountered in modern experiments leveraging omic technologies. SHA-JAM aims to estimate the conditional effect for high-dimensional risk factors on an outcome by incorporating estimates from association analyses of single-nucleotide polymorphism (SNP)-intermediate or SNP-gene expression as prior information in a hierarchical model. Results from extensive simulation studies demonstrate that SHA-JAM yields a higher area under the receiver operating characteristics curve (AUC), a lower mean-squared error of the estimates, and a much faster computation speed, compared to an existing approach for similar analyses. In two applied examples for prostate cancer, we investigated metabolite and transcriptome associations, respectively, using summary statistics from a GWAS for prostate cancer with more than 140,000 men and high dimensional publicly available summary data for metabolites and transcriptomes.
工具变量(IV)分析已广泛应用于流行病学,利用观察数据推断因果关系。在孟德尔随机化和全转录组关联研究中,遗传变异也可被视为有效的工具变量。然而,大多数多变量 IV 方法无法扩展到高通量实验数据。在这里,我们利用之前工作的灵活性--联合分析边际汇总统计量的分层模型(hJAM)--建立了一个可扩展的框架(SHA-JAM),该框架可应用于大量中间产物和大量相关遗传变异--这是在利用 omic 技术的现代实验中经常遇到的情况。SHA-JAM旨在通过将单核苷酸多态性(SNP)-中间体或SNP-基因表达关联分析的估计值作为分层模型中的先验信息,估计高维风险因素对结果的条件效应。大量模拟研究结果表明,与现有的类似分析方法相比,SHA-JAM 的接收者操作特征曲线下面积(AUC)更大,估计值的均方误差更小,计算速度更快。在前列腺癌的两个应用实例中,我们使用来自超过 140,000 名男性前列腺癌 GWAS 的汇总统计数据以及代谢物和转录组的高维公开汇总数据,分别研究了代谢物和转录组之间的关联。
{"title":"Hierarchical joint analysis of marginal summary statistics—Part II: High-dimensional instrumental analysis of omics data","authors":"Lai Jiang, Jiayi Shen, Burcu F. Darst, Christopher A. Haiman, Nicholas Mancuso, David V. Conti","doi":"10.1002/gepi.22577","DOIUrl":"10.1002/gepi.22577","url":null,"abstract":"<p>Instrumental variable (IV) analysis has been widely applied in epidemiology to infer causal relationships using observational data. Genetic variants can also be viewed as valid IVs in Mendelian randomization and transcriptome-wide association studies. However, most multivariate IV approaches cannot scale to high-throughput experimental data. Here, we leverage the flexibility of our previous work, a hierarchical model that jointly analyzes marginal summary statistics (hJAM), to a scalable framework (SHA-JAM) that can be applied to a large number of intermediates and a large number of correlated genetic variants—situations often encountered in modern experiments leveraging omic technologies. SHA-JAM aims to estimate the conditional effect for high-dimensional risk factors on an outcome by incorporating estimates from association analyses of single-nucleotide polymorphism (SNP)-intermediate or SNP-gene expression as prior information in a hierarchical model. Results from extensive simulation studies demonstrate that SHA-JAM yields a higher area under the receiver operating characteristics curve (AUC), a lower mean-squared error of the estimates, and a much faster computation speed, compared to an existing approach for similar analyses. In two applied examples for prostate cancer, we investigated metabolite and transcriptome associations, respectively, using summary statistics from a GWAS for prostate cancer with more than 140,000 men and high dimensional publicly available summary data for metabolites and transcriptomes.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 7","pages":"291-309"},"PeriodicalIF":1.7,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22577","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141418544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genome-wide association studies (GWAS) have been helpful in identifying genetic variants predicting cancer risk and providing new insights into cancer biology. Increasing use of genetically informed care, as well as genetically informed prevention and treatment strategies, have also drawn attention to some of the inherent limitations of cancer genetic data. Specifically, genetic endowment is lifelong. However, those recruited into cancer studies tend to be middle-aged or older people, meaning the exposure most likely starts before recruitment, as opposed to exposure and recruitment aligning, as in a trial or a target trial. Studies in survivors can be biased as a result of depletion of the susceptibles, here specifically due to genetic vulnerability and the cancer of interest or a competing risk. In addition, including prevalent cases in a case-control study will make the genetics of survival with cancer look harmful (Neyman bias). Here, we describe ways of designing GWAS to maximize explanatory power and predictive utility, by reducing selection bias due to only recruiting survivors and reducing Neyman bias due to including prevalent cases alongside using other techniques, such as selection diagrams, age-stratification, and Mendelian randomization, to facilitate GWAS interpretability and utility.
{"title":"Interpreting disease genome-wide association studies and polygenetic risk scores given eligibility and study design considerations","authors":"Catherine Mary Schooling, Mary Beth Terry","doi":"10.1002/gepi.22567","DOIUrl":"10.1002/gepi.22567","url":null,"abstract":"<p>Genome-wide association studies (GWAS) have been helpful in identifying genetic variants predicting cancer risk and providing new insights into cancer biology. Increasing use of genetically informed care, as well as genetically informed prevention and treatment strategies, have also drawn attention to some of the inherent limitations of cancer genetic data. Specifically, genetic endowment is lifelong. However, those recruited into cancer studies tend to be middle-aged or older people, meaning the exposure most likely starts before recruitment, as opposed to exposure and recruitment aligning, as in a trial or a target trial. Studies in survivors can be biased as a result of depletion of the susceptibles, here specifically due to genetic vulnerability and the cancer of interest or a competing risk. In addition, including prevalent cases in a case-control study will make the genetics of survival with cancer look harmful (Neyman bias). Here, we describe ways of designing GWAS to maximize explanatory power and predictive utility, by reducing selection bias due to only recruiting survivors and reducing Neyman bias due to including prevalent cases alongside using other techniques, such as selection diagrams, age-stratification, and Mendelian randomization, to facilitate GWAS interpretability and utility.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 8","pages":"468-472"},"PeriodicalIF":1.7,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somatic changes like copy number aberrations (CNAs) and epigenetic alterations like methylation have pivotal effects on disease outcomes and prognosis in cancer, by regulating gene expressions, that drive critical biological processes. To identify potential biomarkers and molecular targets and understand how they impact disease outcomes, it is important to identify key groups of CNAs, the associated methylation, and the gene expressions they impact, through a joint integrative analysis. Here, we propose a novel analysis pipeline, the joint sparse canonical correlation analysis (jsCCA), an extension of sCCA, to effectively identify an ensemble of CNAs, methylation sites and gene (expression) components in the context of disease endpoints, especially tumor characteristics. Our approach detects potentially orthogonal gene components that are highly correlated with sets of methylation sites which in turn are correlated with sets of CNA sites. It then identifies the genes within these components that are associated with the outcome. Further, we aggregate the effect of each gene expression set on tumor stage by constructing “gene component scores” and test its interaction with traditional risk factors. Analyzing clinical and genomic data on 515 renal clear cell carcinoma (ccRCC) patients from the TCGA-KIRC, we found eight gene components to be associated with methylation sites, regulated by groups of proximally located CNA sites. Association analysis with tumor stage at diagnosis identified a novel association of expression of ASAH1 gene trans-regulated by methylation of several genes including SIX5 and by CNAs in the 10q25 region including TCF7L2. Further analysis to quantify the overall effect of gene sets on tumor stage, revealed that two of the eight gene components have significant interaction with smoking in relation to tumor stage. These gene components represent distinct biological functions including immune function, inflammatory responses, and hypoxia-regulated pathways. Our findings suggest that jsCCA analysis can identify interpretable and important genes, regulatory structures, and clinically consequential pathways. Such methods are warranted for comprehensive analysis of multimodal data especially in cancer genomics.
{"title":"Identifying genes associated with disease outcomes using joint sparse canonical correlation analysis—An application in renal clear cell carcinoma","authors":"Diptavo Dutta, Ananda Sen, Jaya M. Satagopan","doi":"10.1002/gepi.22566","DOIUrl":"10.1002/gepi.22566","url":null,"abstract":"<p>Somatic changes like copy number aberrations (CNAs) and epigenetic alterations like methylation have pivotal effects on disease outcomes and prognosis in cancer, by regulating gene expressions, that drive critical biological processes. To identify potential biomarkers and molecular targets and understand how they impact disease outcomes, it is important to identify key groups of CNAs, the associated methylation, and the gene expressions they impact, through a joint integrative analysis. Here, we propose a novel analysis pipeline, the joint sparse canonical correlation analysis (jsCCA), an extension of sCCA, to effectively identify an ensemble of CNAs, methylation sites and gene (expression) components in the context of disease endpoints, especially tumor characteristics. Our approach detects potentially orthogonal gene components that are highly correlated with sets of methylation sites which in turn are correlated with sets of CNA sites. It then identifies the genes within these components that are associated with the outcome. Further, we aggregate the effect of each gene expression set on tumor stage by constructing “gene component scores” and test its interaction with traditional risk factors. Analyzing clinical and genomic data on 515 renal clear cell carcinoma (ccRCC) patients from the TCGA-KIRC, we found eight gene components to be associated with methylation sites, regulated by groups of proximally located CNA sites. Association analysis with tumor stage at diagnosis identified a novel association of expression of <i>ASAH1</i> gene trans-regulated by methylation of several genes including <i>SIX5</i> and by CNAs in the 10q25 region including <i>TCF7L2</i>. Further analysis to quantify the overall effect of gene sets on tumor stage, revealed that two of the eight gene components have significant interaction with smoking in relation to tumor stage. These gene components represent distinct biological functions including immune function, inflammatory responses, and hypoxia-regulated pathways. Our findings suggest that jsCCA analysis can identify interpretable and important genes, regulatory structures, and clinically consequential pathways. Such methods are warranted for comprehensive analysis of multimodal data especially in cancer genomics.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 8","pages":"414-432"},"PeriodicalIF":1.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22566","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140943393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}