Charles Spanbauer, Wei Pan, ADNI, The Alzheimer's Disease Neuroimaging Initiative
Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.
{"title":"Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles","authors":"Charles Spanbauer, Wei Pan, ADNI, The Alzheimer's Disease Neuroimaging Initiative","doi":"10.1002/gepi.22505","DOIUrl":"10.1002/gepi.22505","url":null,"abstract":"<p>Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 1","pages":"26-44"},"PeriodicalIF":2.1,"publicationDate":"2022-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22505","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9652572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apostolos Gkatzionis, Stephen Burgess, Paul J. Newcombe
Mendelian randomization (MR) is the use of genetic variants to assess the existence of a causal relationship between a risk factor and an outcome of interest. Here, we focus on two-sample summary-data MR analyses with many correlated variants from a single gene region, particularly on cis-MR studies which use protein expression as a risk factor. Such studies must rely on a small, curated set of variants from the studied region; using all variants in the region requires inverting an ill-conditioned genetic correlation matrix and results in numerically unstable causal effect estimates. We review methods for variable selection and estimation in cis-MR with summary-level data, ranging from stepwise pruning and conditional analysis to principal components analysis, factor analysis, and Bayesian variable selection. In a simulation study, we show that the various methods have comparable performance in analyses with large sample sizes and strong genetic instruments. However, when weak instrument bias is suspected, factor analysis and Bayesian variable selection produce more reliable inferences than simple pruning approaches, which are often used in practice. We conclude by examining two case studies, assessing the effects of low-density lipoprotein-cholesterol and serum testosterone on coronary heart disease risk using variants in the HMGCR and SHBG gene regions, respectively.
{"title":"Statistical methods for cis-Mendelian randomization with two-sample summary-level data","authors":"Apostolos Gkatzionis, Stephen Burgess, Paul J. Newcombe","doi":"10.1002/gepi.22506","DOIUrl":"10.1002/gepi.22506","url":null,"abstract":"<p>Mendelian randomization (MR) is the use of genetic variants to assess the existence of a causal relationship between a risk factor and an outcome of interest. Here, we focus on two-sample summary-data MR analyses with many correlated variants from a single gene region, particularly on <i>cis</i>-MR studies which use protein expression as a risk factor. Such studies must rely on a small, curated set of variants from the studied region; using all variants in the region requires inverting an ill-conditioned genetic correlation matrix and results in numerically unstable causal effect estimates. We review methods for variable selection and estimation in <i>cis</i>-MR with summary-level data, ranging from stepwise pruning and conditional analysis to principal components analysis, factor analysis, and Bayesian variable selection. In a simulation study, we show that the various methods have comparable performance in analyses with large sample sizes and strong genetic instruments. However, when weak instrument bias is suspected, factor analysis and Bayesian variable selection produce more reliable inferences than simple pruning approaches, which are often used in practice. We conclude by examining two case studies, assessing the effects of low-density lipoprotein-cholesterol and serum testosterone on coronary heart disease risk using variants in the <i>HMGCR</i> and <i>SHBG</i> gene regions, respectively.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 1","pages":"3-25"},"PeriodicalIF":2.1,"publicationDate":"2022-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22506","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9297361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Kidd, Chelsea K. Raulerson, Karen L. Mohlke, Dan-Yu Lin
There is an increasing interest in using multiple types of omics features (e.g., DNA sequences, RNA expressions, methylation, protein expressions, and metabolic profiles) to study how the relationships between phenotypes and genotypes may be mediated by other omics markers. Genotypes and phenotypes are typically available for all subjects in genetic studies, but typically, some omics data will be missing for some subjects, due to limitations such as cost and sample quality. In this article, we propose a powerful approach for mediation analysis that accommodates missing data among multiple mediators and allows for various interaction effects. We formulate the relationships among genetic variants, other omics measurements, and phenotypes through linear regression models. We derive the joint likelihood for models with two mediators, accounting for arbitrary patterns of missing values. Utilizing computationally efficient and stable algorithms, we conduct maximum likelihood estimation. Our methods produce unbiased and statistically efficient estimators. We demonstrate the usefulness of our methods through simulation studies and an application to the Metabolic Syndrome in Men study.
{"title":"Mediation analysis of multiple mediators with incomplete omics data","authors":"John Kidd, Chelsea K. Raulerson, Karen L. Mohlke, Dan-Yu Lin","doi":"10.1002/gepi.22504","DOIUrl":"10.1002/gepi.22504","url":null,"abstract":"<p>There is an increasing interest in using multiple types of omics features (e.g., DNA sequences, RNA expressions, methylation, protein expressions, and metabolic profiles) to study how the relationships between phenotypes and genotypes may be mediated by other omics markers. Genotypes and phenotypes are typically available for all subjects in genetic studies, but typically, some omics data will be missing for some subjects, due to limitations such as cost and sample quality. In this article, we propose a powerful approach for mediation analysis that accommodates missing data among multiple mediators and allows for various interaction effects. We formulate the relationships among genetic variants, other omics measurements, and phenotypes through linear regression models. We derive the joint likelihood for models with two mediators, accounting for arbitrary patterns of missing values. Utilizing computationally efficient and stable algorithms, we conduct maximum likelihood estimation. Our methods produce unbiased and statistically efficient estimators. We demonstrate the usefulness of our methods through simulation studies and an application to the Metabolic Syndrome in Men study.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 1","pages":"61-77"},"PeriodicalIF":2.1,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10423053/pdf/nihms-1913096.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9991045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Hsu, Anna Kooperberg, Alexander P. Reiner, Charles Kooperberg
Populations of non-European ancestry are substantially underrepresented in genome-wide association studies (GWAS). As genetic effects can differ between ancestries due to possibly different causal variants or linkage disequilibrium patterns, a meta-analysis that includes GWAS of all populations yields biased estimation in each of the populations and the bias disproportionately impacts non-European ancestry populations. This is because meta-analysis combines study-specific estimates with inverse variance as the weights, which causes biases towards studies with the largest sample size, typical of the European ancestry population. In this paper, we propose two empirical Bayes (EB) estimators to borrow the strength of information across populations although accounting for between-population heterogeneity. Extensive simulation studies show that the proposed EB estimators are largely unbiased and improve efficiency compared to the population-specific estimator. In contrast, even though the meta-analysis estimator has a much smaller variance, it yields significant bias when the genetic effect is heterogeneous across populations. We apply the proposed EB estimators to a large-scale trans-ancestry GWAS of stroke and demonstrate that the EB estimators reduce the variance of the population-specific estimator substantially, with the effect estimates close to the population-specific estimates.
{"title":"An empirical Bayes approach to improving population-specific genetic association estimation by leveraging cross-population data","authors":"Li Hsu, Anna Kooperberg, Alexander P. Reiner, Charles Kooperberg","doi":"10.1002/gepi.22501","DOIUrl":"10.1002/gepi.22501","url":null,"abstract":"<p>Populations of non-European ancestry are substantially underrepresented in genome-wide association studies (GWAS). As genetic effects can differ between ancestries due to possibly different causal variants or linkage disequilibrium patterns, a meta-analysis that includes GWAS of all populations yields biased estimation in each of the populations and the bias disproportionately impacts non-European ancestry populations. This is because meta-analysis combines study-specific estimates with inverse variance as the weights, which causes biases towards studies with the largest sample size, typical of the European ancestry population. In this paper, we propose two empirical Bayes (EB) estimators to borrow the strength of information across populations although accounting for between-population heterogeneity. Extensive simulation studies show that the proposed EB estimators are largely unbiased and improve efficiency compared to the population-specific estimator. In contrast, even though the meta-analysis estimator has a much smaller variance, it yields significant bias when the genetic effect is heterogeneous across populations. We apply the proposed EB estimators to a large-scale trans-ancestry GWAS of stroke and demonstrate that the EB estimators reduce the variance of the population-specific estimator substantially, with the effect estimates close to the population-specific estimates.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 1","pages":"45-60"},"PeriodicalIF":2.1,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22501","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9279720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-09DOI: 10.1101/2022.09.08.22279720
Jerry Z. Zhang, L. W. Heinsberg, Mohanraj Krishnan, N. Hawley, Tanya J. Major, J. Carlson, J. Harré Hindmarsh, H. Watson, Muhammad Qasim, L. Stamp, N. Dalbeth, R. Murphy, Guangyun Sun, Hong Cheng, T. Naseri, M. Reupena, E. Kershaw, R. Deka, S. McGarvey, R. Minster, T. Merriman, D. Weeks
The minor allele of rs373863828, a missense variant in CREB3 Regulatory Factor, is associated with several cardiometabolic phenotypes in Polynesian peoples. To better understand the variant, we tested the association of rs373863828 with a panel of correlated phenotypes (body mass index [BMI], weight, height, HDL cholesterol, triglycerides, and total cholesterol) using multivariate Bayesian association and network analyses in a Samoa cohort (n = 1632), Aotearoa New Zealand cohort (n = 1419), and combined cohort (n = 2976). An expanded set of phenotypes (adding estimated fat and fat‐free mass, abdominal circumference, hip circumference, and abdominal‐hip ratio) was tested in the Samoa cohort (n = 1496). In the Samoa cohort, we observed significant associations (log10 Bayes Factor [BF] ≥ 5.0) between rs373863828 and the overall phenotype panel (8.81), weight (8.30), and BMI (6.42). In the Aotearoa New Zealand cohort, we observed suggestive associations (1.5 < log10BF < 5) between rs373863828 and the overall phenotype panel (4.60), weight (3.27), and BMI (1.80). In the combined cohort, we observed concordant signals with larger log10BFs. In the Samoa‐specific expanded phenotype analyses, we also observed significant associations between rs373863828 and fat mass (5.65), abdominal circumference (5.34), and hip circumference (5.09). Bayesian networks provided evidence for a direct association of rs373863828 with weight and indirect associations with height and BMI.
{"title":"Multivariate analysis of a missense variant in CREBRF reveals associations with measures of adiposity in people of Polynesian ancestries","authors":"Jerry Z. Zhang, L. W. Heinsberg, Mohanraj Krishnan, N. Hawley, Tanya J. Major, J. Carlson, J. Harré Hindmarsh, H. Watson, Muhammad Qasim, L. Stamp, N. Dalbeth, R. Murphy, Guangyun Sun, Hong Cheng, T. Naseri, M. Reupena, E. Kershaw, R. Deka, S. McGarvey, R. Minster, T. Merriman, D. Weeks","doi":"10.1101/2022.09.08.22279720","DOIUrl":"https://doi.org/10.1101/2022.09.08.22279720","url":null,"abstract":"The minor allele of rs373863828, a missense variant in CREB3 Regulatory Factor, is associated with several cardiometabolic phenotypes in Polynesian peoples. To better understand the variant, we tested the association of rs373863828 with a panel of correlated phenotypes (body mass index [BMI], weight, height, HDL cholesterol, triglycerides, and total cholesterol) using multivariate Bayesian association and network analyses in a Samoa cohort (n = 1632), Aotearoa New Zealand cohort (n = 1419), and combined cohort (n = 2976). An expanded set of phenotypes (adding estimated fat and fat‐free mass, abdominal circumference, hip circumference, and abdominal‐hip ratio) was tested in the Samoa cohort (n = 1496). In the Samoa cohort, we observed significant associations (log10 Bayes Factor [BF] ≥ 5.0) between rs373863828 and the overall phenotype panel (8.81), weight (8.30), and BMI (6.42). In the Aotearoa New Zealand cohort, we observed suggestive associations (1.5 < log10BF < 5) between rs373863828 and the overall phenotype panel (4.60), weight (3.27), and BMI (1.80). In the combined cohort, we observed concordant signals with larger log10BFs. In the Samoa‐specific expanded phenotype analyses, we also observed significant associations between rs373863828 and fat mass (5.65), abdominal circumference (5.34), and hip circumference (5.09). Bayesian networks provided evidence for a direct association of rs373863828 with weight and indirect associations with height and BMI.","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 1","pages":"105 - 118"},"PeriodicalIF":2.1,"publicationDate":"2022-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45176472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Payman Nickchi, Charith Karunarathna, Jinko Graham
Linkage analysis maps genetic loci for a heritable trait by identifying genomic regions with excess relatedness among individuals with similar trait values. Analysis may be conducted on related individuals from families, or on samples of unrelated individuals from a population. For allelically heterogeneous traits, population-based linkage analysis can be more powerful than genotypic-association analysis. Here, we focus on linkage analysis in a population sample, but use sequences rather than individuals as our unit of observation. Earlier investigations of sequence-based linkage mapping relied on known sequence relatedness, whereas we infer relatedness from the sequence data. We propose two ways to associate similarity in relatedness of sequences with similarity in their trait values and compare the resulting linkage methods to two genotypic-association methods. We also introduce a procedure to label case sequences as potential carriers or noncarriers of causal variants after an association has been found. This post hoc labeling of case sequences is based on inferred relatedness to other case sequences. Our simulation results indicate that methods based on sequence relatedness improve localization and perform as well as genotypic-association methods for detecting rare causal variants. Sequence-based linkage analysis therefore has potential to fine-map allelically heterogeneous disease traits.
{"title":"An exploration of linkage fine-mapping on sequences from case-control studies","authors":"Payman Nickchi, Charith Karunarathna, Jinko Graham","doi":"10.1002/gepi.22502","DOIUrl":"10.1002/gepi.22502","url":null,"abstract":"<p>Linkage analysis maps genetic loci for a heritable trait by identifying genomic regions with excess relatedness among individuals with similar trait values. Analysis may be conducted on related individuals from families, or on samples of unrelated individuals from a population. For allelically heterogeneous traits, population-based linkage analysis can be more powerful than genotypic-association analysis. Here, we focus on linkage analysis in a population sample, but use sequences rather than individuals as our unit of observation. Earlier investigations of sequence-based linkage mapping relied on known sequence relatedness, whereas we infer relatedness from the sequence data. We propose two ways to associate similarity in relatedness of sequences with similarity in their trait values and compare the resulting linkage methods to two genotypic-association methods. We also introduce a procedure to label case sequences as potential carriers or noncarriers of causal variants after an association has been found. This post hoc labeling of case sequences is based on inferred relatedness to other case sequences. Our simulation results indicate that methods based on sequence relatedness improve localization and perform as well as genotypic-association methods for detecting rare causal variants. Sequence-based linkage analysis therefore has potential to fine-map allelically heterogeneous disease traits.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 1","pages":"78-94"},"PeriodicalIF":2.1,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/83/83/GEPI-47-78.PMC10087369.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9339280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlo Maj, Christian Staerk, Oleg Borisov, Hannah Klinkhammer, Ming Wai Yeung, Peter Krawitz, Andreas Mayr
Polygenic risk scores quantify the individual genetic predisposition regarding a particular trait. We propose and illustrate the application of existing statistical learning methods to derive sparser models for genome-wide data with a polygenic signal. Our approach is based on three consecutive steps. First, potentially informative loci are identified by a marginal screening approach. Then, fine-mapping is independently applied for blocks of variants in linkage disequilibrium, where informative variants are retrieved by using variable selection methods including boosting with probing and stochastic searches with the Adaptive Subspace method. Finally, joint prediction models with the selected variants are derived using statistical boosting. In contrast to alternative approaches relying on univariate summary statistics from genome-wide association studies, our three-step approach enables to select and fit multivariable regression models on large-scale genotype data. Based on UK Biobank data, we develop prediction models for LDL-cholesterol as a continuous trait. Additionally, we consider a recent scalable algorithm for the Lasso. Results show that statistical learning approaches based on fine-mapping of genetic signals result in a competitive prediction performance compared to classical polygenic risk approaches, while yielding sparser risk models.
{"title":"Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol","authors":"Carlo Maj, Christian Staerk, Oleg Borisov, Hannah Klinkhammer, Ming Wai Yeung, Peter Krawitz, Andreas Mayr","doi":"10.1002/gepi.22495","DOIUrl":"10.1002/gepi.22495","url":null,"abstract":"<p>Polygenic risk scores quantify the individual genetic predisposition regarding a particular trait. We propose and illustrate the application of existing statistical learning methods to derive sparser models for genome-wide data with a polygenic signal. Our approach is based on three consecutive steps. First, potentially informative loci are identified by a marginal screening approach. Then, fine-mapping is independently applied for blocks of variants in linkage disequilibrium, where informative variants are retrieved by using variable selection methods including boosting with probing and stochastic searches with the Adaptive Subspace method. Finally, joint prediction models with the selected variants are derived using statistical boosting. In contrast to alternative approaches relying on univariate summary statistics from genome-wide association studies, our three-step approach enables to select and fit multivariable regression models on large-scale genotype data. Based on UK Biobank data, we develop prediction models for LDL-cholesterol as a continuous trait. Additionally, we consider a recent scalable algorithm for the Lasso. Results show that statistical learning approaches based on fine-mapping of genetic signals result in a competitive prediction performance compared to classical polygenic risk approaches, while yielding sparser risk models.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"46 8","pages":"589-603"},"PeriodicalIF":2.1,"publicationDate":"2022-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22495","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10728321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James J. Fryett, Andrew P. Morris, Heather J. Cordell
As popularised by PrediXcan (and related methods), transcriptome-wide association studies (TWAS), in which gene expression is imputed from single-nucleotide polymorphism (SNP) genotypes and tested for association with a phenotype, are a popular approach for investigating the role of gene expression in complex traits. Like gene expression, DNA methylation is an important biological process and, being under genetic regulation, may be imputable from SNP genotypes. Here, we investigate prediction of CpG methylation levels from SNP genotype data to help elucidate relationships between methylation, gene expression and complex traits. We start by examining how well CpG methylation can be predicted from SNP genotypes, comparing three penalised regression approaches and examining whether changing the window size improves prediction accuracy. Although methylation at most CpG sites cannot be accurately predicted from SNP genotypes, for a subset it can be predicted well. We next apply our methylation prediction models (trained using the optimal method and window size) to carry out a methylome-wide association study (MWAS) of primary biliary cholangitis. We intersect the regions identified via MWAS with those identified via TWAS, providing insight into the interplay between CpG methylation, gene expression and disease status. We conclude that MWAS has the potential to improve understanding of biological mechanisms in complex traits.
{"title":"Investigating the prediction of CpG methylation levels from SNP genotype data to help elucidate relationships between methylation, gene expression and complex traits","authors":"James J. Fryett, Andrew P. Morris, Heather J. Cordell","doi":"10.1002/gepi.22496","DOIUrl":"10.1002/gepi.22496","url":null,"abstract":"<p>As popularised by PrediXcan (and related methods), transcriptome-wide association studies (TWAS), in which gene expression is imputed from single-nucleotide polymorphism (SNP) genotypes and tested for association with a phenotype, are a popular approach for investigating the role of gene expression in complex traits. Like gene expression, DNA methylation is an important biological process and, being under genetic regulation, may be imputable from SNP genotypes. Here, we investigate prediction of CpG methylation levels from SNP genotype data to help elucidate relationships between methylation, gene expression and complex traits. We start by examining how well CpG methylation can be predicted from SNP genotypes, comparing three penalised regression approaches and examining whether changing the window size improves prediction accuracy. Although methylation at most CpG sites cannot be accurately predicted from SNP genotypes, for a subset it can be predicted well. We next apply our methylation prediction models (trained using the optimal method and window size) to carry out a methylome-wide association study (MWAS) of primary biliary cholangitis. We intersect the regions identified via MWAS with those identified via TWAS, providing insight into the interplay between CpG methylation, gene expression and disease status. We conclude that MWAS has the potential to improve understanding of biological mechanisms in complex traits.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"46 8","pages":"629-643"},"PeriodicalIF":2.1,"publicationDate":"2022-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9804820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9152039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexa A. Woodward, Ryan J. Urbanowicz, Adam C. Naj, Jason H. Moore
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments. Failure to account for genetic heterogeneity may lead to missed associations and incorrect inferences. Thus, it is critical to review the impact of genetic heterogeneity on the design and analysis of population level genetic studies, aspects that are often overlooked in the literature. In this review, we first contextualize our approach to genetic heterogeneity by proposing a high-level categorization of heterogeneity into “feature,” “outcome,” and “associative” heterogeneity, drawing on perspectives from epidemiology and machine learning to illustrate distinctions between them. We highlight the unique nature of genetic heterogeneity as a heterogeneous pattern of association that warrants specific methodological considerations. We then focus on the challenges that preclude effective detection and characterization of genetic heterogeneity across a variety of epidemiological contexts. Finally, we discuss systems heterogeneity as an integrated approach to using genetic and other high-dimensional multi-omic data in complex disease research.
{"title":"Genetic heterogeneity: Challenges, impacts, and methods through an associative lens","authors":"Alexa A. Woodward, Ryan J. Urbanowicz, Adam C. Naj, Jason H. Moore","doi":"10.1002/gepi.22497","DOIUrl":"10.1002/gepi.22497","url":null,"abstract":"<p>Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments. Failure to account for genetic heterogeneity may lead to missed associations and incorrect inferences. Thus, it is critical to review the impact of genetic heterogeneity on the design and analysis of population level genetic studies, aspects that are often overlooked in the literature. In this review, we first contextualize our approach to genetic heterogeneity by proposing a high-level categorization of heterogeneity into “feature,” “outcome,” and “associative” heterogeneity, drawing on perspectives from epidemiology and machine learning to illustrate distinctions between them. We highlight the unique nature of genetic heterogeneity as a heterogeneous pattern of association that warrants specific methodological considerations. We then focus on the challenges that preclude effective detection and characterization of genetic heterogeneity across a variety of epidemiological contexts. Finally, we discuss systems heterogeneity as an integrated approach to using genetic and other high-dimensional multi-omic data in complex disease research.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"46 8","pages":"555-571"},"PeriodicalIF":2.1,"publicationDate":"2022-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/cf/6a/GEPI-46-555.PMC9669229.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10484080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amke Caliebe, Fasil Tekola-Ayele, Burcu F. Darst, Xuexia Wang, Yeunjoo E. Song, Jiang Gui, Ronnie A. Sebro, David J. Balding, Mohamad Saad, Marie-Pierre Dubé, IGES ELSI Committee
The inclusion of ancestrally diverse participants in genetic studies can lead to new discoveries and is important to ensure equitable health care benefit from research advances. Here, members of the Ethical, Legal, Social, Implications (ELSI) committee of the International Genetic Epidemiology Society (IGES) offer perspectives on methods and analysis tools for the conduct of inclusive genetic epidemiology research, with a focus on admixed and ancestrally diverse populations in support of reproducible research practices. We emphasize the importance of distinguishing socially defined population categorizations from genetic ancestry in the design, analysis, reporting, and interpretation of genetic epidemiology research findings. Finally, we discuss the current state of genomic resources used in genetic association studies, functional interpretation, and clinical and public health translation of genomic findings with respect to diverse populations.
{"title":"Including diverse and admixed populations in genetic epidemiology research","authors":"Amke Caliebe, Fasil Tekola-Ayele, Burcu F. Darst, Xuexia Wang, Yeunjoo E. Song, Jiang Gui, Ronnie A. Sebro, David J. Balding, Mohamad Saad, Marie-Pierre Dubé, IGES ELSI Committee","doi":"10.1002/gepi.22492","DOIUrl":"10.1002/gepi.22492","url":null,"abstract":"<p>The inclusion of ancestrally diverse participants in genetic studies can lead to new discoveries and is important to ensure equitable health care benefit from research advances. Here, members of the Ethical, Legal, Social, Implications (ELSI) committee of the International Genetic Epidemiology Society (IGES) offer perspectives on methods and analysis tools for the conduct of inclusive genetic epidemiology research, with a focus on admixed and ancestrally diverse populations in support of reproducible research practices. We emphasize the importance of distinguishing socially defined population categorizations from genetic ancestry in the design, analysis, reporting, and interpretation of genetic epidemiology research findings. Finally, we discuss the current state of genomic resources used in genetic association studies, functional interpretation, and clinical and public health translation of genomic findings with respect to diverse populations.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"46 7","pages":"347-371"},"PeriodicalIF":2.1,"publicationDate":"2022-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9452464/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9448066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}