Type 1 diabetes (T1D) has a large genetic component, and expanded genetic studies of T1D can lead to novel biological and therapeutic discovery and improved risk prediction. In this study, we performed genetic association and fine-mapping analyses in 817,718 European ancestry samples genome-wide and 29,746 samples at the MHC locus, which identified 165 independent risk signals for T1D of which 19 were novel. We used risk variants to train a machine learning model (named T1GRS) to predict T1D, which highly differentiated T1D from non-disease and type 2 diabetes (T2D) in Europeans as well as African Americans at or beyond the level of current standards. We identified extensive non-linear interactions between risk loci in T1GRS, for example between HLA-DQB1*57 and INS, coding and non-coding HLA alleles, and DEXI, INS and other beta cell loci, that provided mechanistic insight and improved risk prediction. T1D individuals formed distinct clusters based on genetic features from T1GRS which had significant differences in age of onset, HbA1c, and renal disease severity. Finally, we provided T1GRS in formats to enhance accessibility of risk prediction to any user and computing environment. Overall, the improved genetic discovery and prediction of T1D will have wide clinical, therapeutic, and research applications.
{"title":"Genetic association and machine learning improves discovery and prediction of type 1 diabetes","authors":"Carolyn McGrail, Timothy J Sears, Parul Kudtarkar, Hannah Carter, Kyle J Gaulton","doi":"10.1101/2024.07.31.24311310","DOIUrl":"https://doi.org/10.1101/2024.07.31.24311310","url":null,"abstract":"Type 1 diabetes (T1D) has a large genetic component, and expanded genetic studies of T1D can lead to novel biological and therapeutic discovery and improved risk prediction. In this study, we performed genetic association and fine-mapping analyses in 817,718 European ancestry samples genome-wide and 29,746 samples at the MHC locus, which identified 165 independent risk signals for T1D of which 19 were novel. We used risk variants to train a machine learning model (named T1GRS) to predict T1D, which highly differentiated T1D from non-disease and type 2 diabetes (T2D) in Europeans as well as African Americans at or beyond the level of current standards. We identified extensive non-linear interactions between risk loci in T1GRS, for example between HLA-DQB1*57 and INS, coding and non-coding HLA alleles, and DEXI, INS and other beta cell loci, that provided mechanistic insight and improved risk prediction. T1D individuals formed distinct clusters based on genetic features from T1GRS which had significant differences in age of onset, HbA1c, and renal disease severity. Finally, we provided T1GRS in formats to enhance accessibility of risk prediction to any user and computing environment. Overall, the improved genetic discovery and prediction of T1D will have wide clinical, therapeutic, and research applications.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141887145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-02DOI: 10.1101/2024.07.31.24311311
Okan B Ozdemir, Ruining Chen, Ruowang Li
Genome-wide association studies (GWAS) of various heritable human traits and diseases have identified numerous associated single nucleotide polymorphisms (SNPs), most of which have small or modest effects. Polygenic risk scores (PRS) aim to better estimate individuals' genetic predisposition by aggregating the effects of multiple SNPs from GWAS. However, current PRS is designed to capture only simple linear genetic effects across the genome, limiting their ability to fully account for the complex polygenic architecture. To address this, we propose DeepEnsembleEncodeNet (DEEN), a new method that ensembles autoencoders and fully connected neural networks (FCNNs) to better identify and model linear and non-linear SNP effects across different genomic regions, improving its ability to predict disease risks. To demonstrate DEEN's performance, we optimized the model across binary and continuous traits from the UK Biobank (UKBB). Model evaluation on the held-out UKBB testing dataset, as well as the independent All of Us (AoU) dataset, showed improved prediction and risk stratification, consistently outperforming other methods.
针对人类各种遗传性状和疾病的全基因组关联研究(GWAS)发现了许多相关的单核苷酸多态性(SNPs),其中大多数影响较小或不大。多基因风险评分(PRS)旨在通过汇总 GWAS 中多个 SNPs 的影响,更好地估计个体的遗传易感性。然而,目前的多基因风险评分仅能捕捉整个基因组中简单的线性遗传效应,从而限制了其充分考虑复杂的多基因结构的能力。为了解决这个问题,我们提出了 DeepEnsembleEncodeNet(DEEN),这是一种将自动编码器和全连接神经网络(FCNN)组合在一起的新方法,可以更好地识别不同基因组区域的线性和非线性 SNP 效应并建立模型,从而提高预测疾病风险的能力。为了证明 DEEN 的性能,我们在英国生物库 (UKBB) 的二元和连续性状中对模型进行了优化。在英国生物库测试数据集和独立的 "我们所有人"(AoU)数据集上进行的模型评估显示,预测和风险分层能力得到了提高,始终优于其他方法。
{"title":"A Deep Ensemble Encoder Network Method for Improved Polygenic Risk Score Prediction","authors":"Okan B Ozdemir, Ruining Chen, Ruowang Li","doi":"10.1101/2024.07.31.24311311","DOIUrl":"https://doi.org/10.1101/2024.07.31.24311311","url":null,"abstract":"Genome-wide association studies (GWAS) of various heritable human traits and diseases have identified numerous associated single nucleotide polymorphisms (SNPs), most of which have small or modest effects. Polygenic risk scores (PRS) aim to better estimate individuals' genetic predisposition by aggregating the effects of multiple SNPs from GWAS. However, current PRS is designed to capture only simple linear genetic effects across the genome, limiting their ability to fully account for the complex polygenic architecture. To address this, we propose DeepEnsembleEncodeNet (DEEN), a new method that ensembles autoencoders and fully connected neural networks (FCNNs) to better identify and model linear and non-linear SNP effects across different genomic regions, improving its ability to predict disease risks. To demonstrate DEEN's performance, we optimized the model across binary and continuous traits from the UK Biobank (UKBB). Model evaluation on the held-out UKBB testing dataset, as well as the independent All of Us (AoU) dataset, showed improved prediction and risk stratification, consistently outperforming other methods.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1101/2024.07.30.24311226
Junyoung Park, Andrés Peña-Tauber, Lia Talozzi, Michael D. Greicius, Yann Le Guen
Human lifespan is shaped by both genetic and environmental exposures and their interaction. To enable precision health, it is essential to understand how genetic variants contribute to earlier death or prolonged survival. In this study, we tested the association of common genetic variants and the burden of rare non-synonymous variants in a survival analysis, using age-at-death (N = 35,551, median [min, max] = 72.4 [40.9, 85.2]), and last-known-age (N = 358,282, median [min, max] = 71.9 [52.6, 88.7]), in European ancestry participants of the UK Biobank. The associations we identified seemed predominantly driven by cancer, likely due to the age range of the cohort. Common variant analysis highlighted three longevity-associated loci: APOE, ZSCAN23, and MUC5B. We identified six genes whose burden of loss-of-function variants is significantly associated with reduced lifespan: TET2, ATM, BRCA2, CKMT1B, BRCA1 and ASXL1. Additionally, in eight genes, the burden of pathogenic missense variants was associated with reduced lifespan: DNMT3A, SF3B1, CHL1, TET2, PTEN, SOX21, TP53 and SRSF2. Most of these genes have previously been linked to oncogenic-related pathways and some are linked to and are known to harbor somatic variants that predispose to clonal hematopoiesis. A direction-agnostic (SKAT-O) approach additionally identified significant associations with C1orf52, TERT, IDH2, and RLIM, highlighting a link between telomerase function and longevity as well as identifying additional oncogenic genes. Our results emphasize the importance of understanding genetic factors driving the most prevalent causes of mortality at a population level, highlighting the potential of early genetic testing to identify germline and somatic variants increasing one's susceptibility to cancer and/or early death.
{"title":"Genetic associations with human longevity are enriched for oncogenic genes.","authors":"Junyoung Park, Andrés Peña-Tauber, Lia Talozzi, Michael D. Greicius, Yann Le Guen","doi":"10.1101/2024.07.30.24311226","DOIUrl":"https://doi.org/10.1101/2024.07.30.24311226","url":null,"abstract":"Human lifespan is shaped by both genetic and environmental exposures and their interaction. To enable precision health, it is essential to understand how genetic variants contribute to earlier death or prolonged survival. In this study, we tested the association of common genetic variants and the burden of rare non-synonymous variants in a survival analysis, using age-at-death (N = 35,551, median [min, max] = 72.4 [40.9, 85.2]), and last-known-age (N = 358,282, median [min, max] = 71.9 [52.6, 88.7]), in European ancestry participants of the UK Biobank. The associations we identified seemed predominantly driven by cancer, likely due to the age range of the cohort. Common variant analysis highlighted three longevity-associated loci: <em>APOE</em>, <em>ZSCAN23</em>, and <em>MUC5B</em>. We identified six genes whose burden of loss-of-function variants is significantly associated with reduced lifespan: <em>TET2</em>, <em>ATM</em>, <em>BRCA2</em>, <em>CKMT1B</em>, <em>BRCA1</em> and <em>ASXL1</em>. Additionally, in eight genes, the burden of pathogenic missense variants was associated with reduced lifespan: <em>DNMT3A</em>, <em>SF3B1</em>, <em>CHL1</em>, <em>TET2</em>, <em>PTEN</em>, <em>SOX21</em>, <em>TP53</em> and <em>SRSF2</em>. Most of these genes have previously been linked to oncogenic-related pathways and some are linked to and are known to harbor somatic variants that predispose to clonal hematopoiesis. A direction-agnostic (SKAT-O) approach additionally identified significant associations with <em>C1orf52</em>, <em>TERT</em>, <em>IDH2</em>, and <em>RLIM</em>, highlighting a link between telomerase function and longevity as well as identifying additional oncogenic genes.\u0000Our results emphasize the importance of understanding genetic factors driving the most prevalent causes of mortality at a population level, highlighting the potential of early genetic testing to identify germline and somatic variants increasing one's susceptibility to cancer and/or early death.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1101/2024.07.30.24311241
Marcela A Johnson, Liping Hou, Bevan Emma Huang, Assieh Saadatpour, Abolfazl Doostparast Torshizi
Identifying genetic variants associated with lung cancer (LC) risk and their impact on plasma protein levels is crucial for understanding LC predisposition. The discovery of risk biomarkers can enhance early LC screening protocols and improve prognostic interventions. In this study, we performed a genome-wide association analysis using the UK Biobank and FinnGen. We identified genetic variants associated with LC and protein levels leveraging the UK Biobank Pharma Proteomics Project. The dysregulated proteins were then analyzed in pre-symptomatic LC cases compared to healthy controls followed by training machine learning models to predict future LC diagnosis. We achieved median AUCs ranging from 0.79 to 0.88 (0-4 years before diagnosis/YBD), 0.73 to 0.83 (5-9YBD), and 0.78 to 0.84 (0-9YBD) based on 5-fold cross-validation. Conducting survival analysis using the 5-9YBD cohort, we identified eight proteins, including CALCB, PLAUR/uPAR, and CD74 whose higher levels were associated with worse overall survival. We also identified potential plasma biomarkers, including previously reported candidates such as CEACAM5, CXCL17, GDF15, and WFDC2, which have shown associations with future LC diagnosis. These proteins are enriched in various pathways, including cytokine signaling, interleukin regulation, neutrophil degranulation, and lung fibrosis. In conclusion, this study generates novel insights into our understanding of the genome-proteome dynamics in LC. Furthermore, our findings present a promising panel of non-invasive plasma biomarkers that hold potential to support early LC screening initiatives and enhance future diagnostic interventions.
{"title":"Machine learning-based proteogenomic data modeling identifies circulating plasma biomarkers for early detection of lung cancer","authors":"Marcela A Johnson, Liping Hou, Bevan Emma Huang, Assieh Saadatpour, Abolfazl Doostparast Torshizi","doi":"10.1101/2024.07.30.24311241","DOIUrl":"https://doi.org/10.1101/2024.07.30.24311241","url":null,"abstract":"Identifying genetic variants associated with lung cancer (LC) risk and their impact on plasma protein levels is crucial for understanding LC predisposition. The discovery of risk biomarkers can enhance early LC screening protocols and improve prognostic interventions. In this study, we performed a genome-wide association analysis using the UK Biobank and FinnGen. We identified genetic variants associated with LC and protein levels leveraging the UK Biobank Pharma Proteomics Project. The dysregulated proteins were then analyzed in pre-symptomatic LC cases compared to healthy controls followed by training machine learning models to predict future LC diagnosis. We achieved median AUCs ranging from 0.79 to 0.88 (0-4 years before diagnosis/YBD), 0.73 to 0.83 (5-9YBD), and 0.78 to 0.84 (0-9YBD) based on 5-fold cross-validation. Conducting survival analysis using the 5-9YBD cohort, we identified eight proteins, including CALCB, PLAUR/uPAR, and CD74 whose higher levels were associated with worse overall survival. We also identified potential plasma biomarkers, including previously reported candidates such as CEACAM5, CXCL17, GDF15, and WFDC2, which have shown associations with future LC diagnosis. These proteins are enriched in various pathways, including cytokine signaling, interleukin regulation, neutrophil degranulation, and lung fibrosis. In conclusion, this study generates novel insights into our understanding of the genome-proteome dynamics in LC. Furthermore, our findings present a promising panel of non-invasive plasma biomarkers that hold potential to support early LC screening initiatives and enhance future diagnostic interventions.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1101/2024.07.31.24311250
Vincent-Raphael Bourque, Zoe Schmilovich, Guillaume Huguet, Jade England, Adeniran Okewole, Cecile Poulain, Thomas Renne, Martineau Jean-Louis, Zohra Saci, Xinhe Zhang, Thomas Rolland, Aurelie Labbe, Jacob Vorstman, Guy Rouleau, Simon Baron-Cohen, Laurent Mottron, Richard A.I. Bethlehem, Varun Warrier, Sebastien Jacquemont
Although the first signs of autism are often observed as early as 18-36 months of age, there is a broad uncertainty regarding future development, and clinicians lack predictive tools to identify those who will later be diagnosed with co-occurring intellectual disability (ID). Here, we developed predictive models of ID in autistic children (n=5,633 from three cohorts), integrating different classes of genetic variants alongside developmental milestones. The integrated model yielded an AUC ROC=0.65, with this predictive performance cross-validated and generalised across cohorts. Positive predictive values reached up to 55%, accurately identifying 10% of ID cases. The ability to stratify the probabilities of ID using genetic variants was up to twofold greater in individuals with delayed milestones compared to those with typical development. These findings underscore the potential of models in neurodevelopmental medicine that integrate genomics and clinical observations to predict outcomes and target interventions.
虽然自闭症的最初症状通常在 18-36 个月大时就可观察到,但未来的发展却存在广泛的不确定性,临床医生缺乏预测工具来识别那些日后会被诊断为并发智障(ID)的儿童。在此,我们开发了自闭症儿童智障的预测模型(n=5,633,来自三个队列),将不同类别的遗传变异与发育里程碑整合在一起。综合模型的AUC ROC=0.65,这一预测性能经过交叉验证,并在不同队列中得到推广。阳性预测值高达 55%,能准确识别 10% 的 ID 病例。利用基因变异对发育里程碑延迟个体的ID概率进行分层的能力是发育典型个体的两倍。这些发现强调了神经发育医学模型的潜力,该模型整合了基因组学和临床观察,可预测结果并有针对性地采取干预措施。
{"title":"Integrating genomic variants and developmental milestones to predict cognitive and adaptive outcomes in autistic children","authors":"Vincent-Raphael Bourque, Zoe Schmilovich, Guillaume Huguet, Jade England, Adeniran Okewole, Cecile Poulain, Thomas Renne, Martineau Jean-Louis, Zohra Saci, Xinhe Zhang, Thomas Rolland, Aurelie Labbe, Jacob Vorstman, Guy Rouleau, Simon Baron-Cohen, Laurent Mottron, Richard A.I. Bethlehem, Varun Warrier, Sebastien Jacquemont","doi":"10.1101/2024.07.31.24311250","DOIUrl":"https://doi.org/10.1101/2024.07.31.24311250","url":null,"abstract":"Although the first signs of autism are often observed as early as 18-36 months of age, there is a broad uncertainty regarding future development, and clinicians lack predictive tools to identify those who will later be diagnosed with co-occurring intellectual disability (ID). Here, we developed predictive models of ID in autistic children (n=5,633 from three cohorts), integrating different classes of genetic variants alongside developmental milestones. The integrated model yielded an AUC ROC=0.65, with this predictive performance cross-validated and generalised across cohorts. Positive predictive values reached up to 55%, accurately identifying 10% of ID cases. The ability to stratify the probabilities of ID using genetic variants was up to twofold greater in individuals with delayed milestones compared to those with typical development. These findings underscore the potential of models in neurodevelopmental medicine that integrate genomics and clinical observations to predict outcomes and target interventions.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1101/2024.07.29.24311189
Lindsay A Guare, Jagyashila Das, Lannawill Caruth, Shefali Setia Verma
Women's health conditions are influenced by both genetic and environmental factors. Understanding these factors individually and their interactions is crucial for implementing preventative, personalized medicine. However, since genetics and environmental exposures, particularly social determinants of health (SDoH), are correlated with race and ancestry, risk models without careful consideration of these measures can exacerbate health disparities. We focused on seven women's health disorders in the All of Us Research Program: breast cancer, cervical cancer, endometriosis, ovarian cancer, preeclampsia, uterine cancer, and uterine fibroids. We computed polygenic risk scores (PRSs) from publicly available weights and tested the effect of the PRSs on their respective phenotypes as well as any effects of genetic risk on age at diagnosis. We next tested the effects of environmental risk factors (BMI, lifestyle measures, and SDoH) on age at diagnosis. Finally, we examined the impact of environmental exposures in modulating genetic risk by stratified logistic regressions for different tertiles of the environment variables, comparing the effect size of the PRS. Of the twelve sets of weights for the seven conditions, nine were significantly and positively associated with their respective phenotypes. None of the PRSs was associated with different age at diagnoses in the time-to-event analyses. The highest environmental risk group tended to be diagnosed earlier than the low and medium-risk groups. For example, the cases of breast cancer, ovarian cancer, uterine cancer, and uterine fibroids in highest BMI tertile were diagnosed significantly earlier than the low and medium BMI groups, respectively). PRS regression coefficients were often the largest in the highest environment risk groups, showing increased susceptibility to genetic risk. This study's strengths include the diversity of the All of Us study cohort, the consideration of SDoH themes, and the examination of key risk factors and their interrelationships. These elements collectively underscore the importance of integrating genetic and environmental data to develop more precise risk models, enhance personalized medicine, and ultimately reduce health disparities.
{"title":"Social Determinants of Health and Lifestyle Risk Factors Modulate Genetic Susceptibility for Women's Health Outcomes","authors":"Lindsay A Guare, Jagyashila Das, Lannawill Caruth, Shefali Setia Verma","doi":"10.1101/2024.07.29.24311189","DOIUrl":"https://doi.org/10.1101/2024.07.29.24311189","url":null,"abstract":"Women's health conditions are influenced by both genetic and environmental factors. Understanding these factors individually and their interactions is crucial for implementing preventative, personalized medicine. However, since genetics and environmental exposures, particularly social determinants of health (SDoH), are correlated with race and ancestry, risk models without careful consideration of these measures can exacerbate health disparities. We focused on seven women's health disorders in the All of Us Research Program: breast cancer, cervical cancer, endometriosis, ovarian cancer, preeclampsia, uterine cancer, and uterine fibroids. We computed polygenic risk scores (PRSs) from publicly available weights and tested the effect of the PRSs on their respective phenotypes as well as any effects of genetic risk on age at diagnosis. We next tested the effects of environmental risk factors (BMI, lifestyle measures, and SDoH) on age at diagnosis. Finally, we examined the impact of environmental exposures in modulating genetic risk by stratified logistic regressions for different tertiles of the environment variables, comparing the effect size of the PRS. Of the twelve sets of weights for the seven conditions, nine were significantly and positively associated with their respective phenotypes. None of the PRSs was associated with different age at diagnoses in the time-to-event analyses. The highest environmental risk group tended to be diagnosed earlier than the low and medium-risk groups. For example, the cases of breast cancer, ovarian cancer, uterine cancer, and uterine fibroids in highest BMI tertile were diagnosed significantly earlier than the low and medium BMI groups, respectively). PRS regression coefficients were often the largest in the highest environment risk groups, showing increased susceptibility to genetic risk. This study's strengths include the diversity of the All of Us study cohort, the consideration of SDoH themes, and the examination of key risk factors and their interrelationships. These elements collectively underscore the importance of integrating genetic and environmental data to develop more precise risk models, enhance personalized medicine, and ultimately reduce health disparities.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1101/2024.07.29.24311183
Courtney Jean Smith, Satu Strausz, FinnGen, Jeffrey P Spence, Hanna M Ollila, Jonathan K Pritchard
The human leukocyte antigen (HLA) region plays an important role in human health through involvement in immune cell recognition and maturation. While genetic variation in the HLA region is associated with many diseases, the pleiotropic patterns of these associations have not been systematically investigated. Here, we developed a haplotype approach to investigate disease associations phenome-wide for 412,181 Finnish individuals and 2,459 traits. Across the 1,035 diseases with a GWAS association, we found a 17-fold average per-SNP enrichment of hits in the HLA region. Altogether, we identified 7,649 HLA associations across 647 traits, including 1,750 associations uncovered by haplotype analysis. We find some haplotypes show trade-offs between diseases, while others consistently increase risk across traits, indicating a complex pleiotropic landscape involving a range of diseases. This study highlights the extensive impact of HLA variation on disease risk, and underscores the importance of classical and non-classical genes, as well as non-coding variation.
{"title":"Haplotype Analysis Reveals Pleiotropic Disease Associations in the HLA Region","authors":"Courtney Jean Smith, Satu Strausz, FinnGen, Jeffrey P Spence, Hanna M Ollila, Jonathan K Pritchard","doi":"10.1101/2024.07.29.24311183","DOIUrl":"https://doi.org/10.1101/2024.07.29.24311183","url":null,"abstract":"The human leukocyte antigen (HLA) region plays an important role in human health through involvement in immune cell recognition and maturation. While genetic variation in the HLA region is associated with many diseases, the pleiotropic patterns of these associations have not been systematically investigated. Here, we developed a haplotype approach to investigate disease associations phenome-wide for 412,181 Finnish individuals and 2,459 traits. Across the 1,035 diseases with a GWAS association, we found a 17-fold average per-SNP enrichment of hits in the HLA region. Altogether, we identified 7,649 HLA associations across 647 traits, including 1,750 associations uncovered by haplotype analysis. We find some haplotypes show trade-offs between diseases, while others consistently increase risk across traits, indicating a complex pleiotropic landscape involving a range of diseases. This study highlights the extensive impact of HLA variation on disease risk, and underscores the importance of classical and non-classical genes, as well as non-coding variation.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-27DOI: 10.1101/2024.07.26.24310191
Jun Qiao, Kaixin Yao, Yujuan Yuan, Xichen Yang, Le Zhou, Yinqi Long, Miaoran Chen, Wenjia Xie, Yixuan Yang, Yangpo Cao, Siim Pauklin, Jinguo Xu, Yining Yang, Yuliang Feng
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, with chronic kidney disease (CKD) identified as a significant risk factor. CKD is primarily monitored through the estimated glomerular filtration rate (eGFR), calculated using the CKD-EPI equation. Although epidemiological and clinical studies have consistently demonstrated strong associations between eGFR and CVDs, the genetic underpinnings of this relationship remain elusive. Recent genome-wide association studies (GWAS) have highlighted the polygenic nature of these conditions and identified several risk loci correlating with their cross-phenotypes. Nonetheless, the extent and pattern of their pleiotropic effects have yet to be fully elucidated. We analyzed the most comprehensive GWAS summary statistics, involving around 7.5 million individuals, to investigate the shared genetic architectures and the underlying mechanisms between eGFR and CVDs, focusing on single nucleotide polymorphisms (SNPs), genes, biological pathways, and proteins exhibiting pleiotropic effects. Our study identified 508 distinct genomic locations associated with pleiotropic effects across multiple traits, involving 379 unique genes, notably L3MBTL3 (6q23.1), MMP24 (20q11.22), and ABO (9q34.2). Additionally, pathways such as stem cell population maintenance and the glutathione metabolism pathway were pivotal in mediating the relationships between these traits. From the perspective of vertical pleiotropy, our findings suggest a causal relationship between eGFR and conditions such as atrial fibrillation and venous thromboembolism. These insights significantly enhance our understanding of the genetic links between eGFR and CVDs, potentially guiding the development of novel therapeutic strategies and improving the clinical management of these conditions.
{"title":"Disentangling shared genetic etiologies for kidney function and cardiovascular diseases","authors":"Jun Qiao, Kaixin Yao, Yujuan Yuan, Xichen Yang, Le Zhou, Yinqi Long, Miaoran Chen, Wenjia Xie, Yixuan Yang, Yangpo Cao, Siim Pauklin, Jinguo Xu, Yining Yang, Yuliang Feng","doi":"10.1101/2024.07.26.24310191","DOIUrl":"https://doi.org/10.1101/2024.07.26.24310191","url":null,"abstract":"Cardiovascular diseases (CVDs) are the leading cause of death worldwide, with chronic kidney disease (CKD) identified as a significant risk factor. CKD is primarily monitored through the estimated glomerular filtration rate (eGFR), calculated using the CKD-EPI equation. Although epidemiological and clinical studies have consistently demonstrated strong associations between eGFR and CVDs, the genetic underpinnings of this relationship remain elusive. Recent genome-wide association studies (GWAS) have highlighted the polygenic nature of these conditions and identified several risk loci correlating with their cross-phenotypes. Nonetheless, the extent and pattern of their pleiotropic effects have yet to be fully elucidated. We analyzed the most comprehensive GWAS summary statistics, involving around 7.5 million individuals, to investigate the shared genetic architectures and the underlying mechanisms between eGFR and CVDs, focusing on single nucleotide polymorphisms (SNPs), genes, biological pathways, and proteins exhibiting pleiotropic effects. Our study identified 508 distinct genomic locations associated with pleiotropic effects across multiple traits, involving 379 unique genes, notably L3MBTL3 (6q23.1), MMP24 (20q11.22), and ABO (9q34.2). Additionally, pathways such as stem cell population maintenance and the glutathione metabolism pathway were pivotal in mediating the relationships between these traits. From the perspective of vertical pleiotropy, our findings suggest a causal relationship between eGFR and conditions such as atrial fibrillation and venous thromboembolism. These insights significantly enhance our understanding of the genetic links between eGFR and CVDs, potentially guiding the development of novel therapeutic strategies and improving the clinical management of these conditions.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141771238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1101/2024.07.25.24311005
Jasper Hof, Doug Speed
Mixed-model association analysis (MMAA) is the preferred tool for performing a genome-wide association study, because it enables robust control of type 1 error and increased statistical power to detect trait-associated loci. However, existing MMAA tools often suffer from long runtimes and high memory requirements. We present LDAK-KVIK, a novel MMAA tool for analyzing quantitative and binary phenotypes. Using simulated phenotypes, we show that LDAK-KVIK produces well-calibrated test statistics, both for homogeneous and heterogeneous datasets. LDAK-KVIK is computationally-efficient, requiring less than 20 CPU hours and 8Gb memory to analyse genome-wide data for 350k individuals. These demands are similar to those of REGENIE, one of the most efficient existing MMAA tools, and up to 30 times less than those of BOLT-LMM, currently the most powerful MMAA tool. When applied to real phenotypes, LDAK-KVIK has the highest power of all tools considered. For example, across 40 quantitative phenotypes from the UK Biobank (average sample size 349k), LDAK-KVIK finds 16% more significant loci than classical linear regression, whereas BOLT-LMM and REGENIE find 15% and 11% more, respectively. LDAK-KVIK can also perform gene-based tests; across the 40 quantitative UK Biobank phenotypes, LDAK-KVIK finds 18% more significant genes than the leading existing tool.
{"title":"LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes","authors":"Jasper Hof, Doug Speed","doi":"10.1101/2024.07.25.24311005","DOIUrl":"https://doi.org/10.1101/2024.07.25.24311005","url":null,"abstract":"Mixed-model association analysis (MMAA) is the preferred tool for performing a genome-wide association study, because it enables robust control of type 1 error and increased statistical power to detect trait-associated loci. However, existing MMAA tools often suffer from long runtimes and high memory requirements. We present LDAK-KVIK, a novel MMAA tool for analyzing quantitative and binary phenotypes. Using simulated phenotypes, we show that LDAK-KVIK produces well-calibrated test statistics, both for homogeneous and heterogeneous datasets. LDAK-KVIK is computationally-efficient, requiring less than 20 CPU hours and 8Gb memory to analyse genome-wide data for 350k individuals. These demands are similar to those of REGENIE, one of the most efficient existing MMAA tools, and up to 30 times less than those of BOLT-LMM, currently the most powerful MMAA tool. When applied to real phenotypes, LDAK-KVIK has the highest power of all tools considered. For example, across 40 quantitative phenotypes from the UK Biobank (average sample size 349k), LDAK-KVIK finds 16% more significant loci than classical linear regression, whereas BOLT-LMM and REGENIE find 15% and 11% more, respectively. LDAK-KVIK can also perform gene-based tests; across the 40 quantitative UK Biobank phenotypes, LDAK-KVIK finds 18% more significant genes than the leading existing tool.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1101/2024.07.25.24310931
Sarah A Abramowitz, Kristin Boulier, Karl Keat, Katherine Cardone, Manu Shivakumar, John M. DePaolo, Renae M. Judy, Penn Medicine BioBank, Dokyoon Kim, Daniel J Rader, Marylyn D Ritchie, Benjamin F Voight, Bogdan Pasaniuc, Michael Levin, Scott M. Damrauer
Importance: Polygenic risk scores (PRSs) for coronary artery disease (CAD) are a growing clinical and commercial reality. Whether existing scores provide similar individual-level assessments of disease liability is a critical consideration for clinical implementation that remains uncharacterized. Objective: Characterize the reliability of CAD PRSs that perform equivalently at the population level at predicting individual-level risk. Design: Cross-sectional Study. Setting: All of Us Research Program (AOU), Penn Medicine Biobank (PMBB), and UCLA ATLAS Precision Health Biobank. Participants: Volunteers of diverse genetic backgrounds enrolled in AOU, PMBB, and UCLA with available electronic health record and genotyping data. Exposures: Polygenic risk for CAD from previously published PRSs and new PRSs developed separately from the testing cohorts. Main Outcomes and Measures: Sets of CAD PRSs that perform population prediction equivalently were identified by comparing calibration and discrimination (Brier score and AUROC) of generalized linear models of prevalent CAD using Bayesian analysis of variance. Among equivalently performing scores, individual-level agreement between risk estimates was tested with intraclass correlation (ICC) and Light's Kappa, measures of inter-rater reliability. Results: 50 PRSs were calculated for 171,095 AOU participants. When included in a model of prevalent CAD, 48 scores had practically equivalent Brier scores and AUROCs (region of practical equivalence = 0.02). Across these scores, 84% of participants had at least one score in both the top and bottom risk quintile. Continuous agreement of individual risk predictions from the 48 scores was poor, with an ICC of 0.351 (95% CI; 0.349, 0.352). Agreement between two statistically equivalent scores was moderate, with an ICC of 0.649 (95% CI; 0.646, 0.652). Light's Kappa, used to evaluate consistency of assignment to high-risk thresholds, did not exceed 0.56 (interpreted as 'fair') across statistically and practically equivalent scores. Repeating the analysis among 41,193 PMBB and 50,748 UCLA participants yielded different sets of statistically and practically equivalent scores which also lacked strong individual agreement. Conclusions and Relevance: Across three diverse biobanks, CAD PRSs that performed equivalently at the population level produced unreliable individual risk estimates. Approaches to clinical implementation of CAD PRSs must consider the potential for discordant individual risk estimates from otherwise indistinguishable scores.
{"title":"Population Performance and Individual Agreement of Coronary Artery Disease Polygenic Risk Scores","authors":"Sarah A Abramowitz, Kristin Boulier, Karl Keat, Katherine Cardone, Manu Shivakumar, John M. DePaolo, Renae M. Judy, Penn Medicine BioBank, Dokyoon Kim, Daniel J Rader, Marylyn D Ritchie, Benjamin F Voight, Bogdan Pasaniuc, Michael Levin, Scott M. Damrauer","doi":"10.1101/2024.07.25.24310931","DOIUrl":"https://doi.org/10.1101/2024.07.25.24310931","url":null,"abstract":"Importance: Polygenic risk scores (PRSs) for coronary artery disease (CAD) are a growing clinical and commercial reality. Whether existing scores provide similar individual-level assessments of disease liability is a critical consideration for clinical implementation that remains uncharacterized. Objective:\u0000Characterize the reliability of CAD PRSs that perform equivalently at the population level at predicting individual-level risk. Design:\u0000Cross-sectional Study. Setting:\u0000All of Us Research Program (AOU), Penn Medicine Biobank (PMBB), and UCLA ATLAS Precision Health Biobank. Participants: Volunteers of diverse genetic backgrounds enrolled in AOU, PMBB, and UCLA with available electronic health record and genotyping data. Exposures:\u0000Polygenic risk for CAD from previously published PRSs and new PRSs developed separately from the testing cohorts. Main Outcomes and Measures:\u0000Sets of CAD PRSs that perform population prediction equivalently were identified by comparing calibration and discrimination (Brier score and AUROC) of generalized linear models of prevalent CAD using Bayesian analysis of variance. Among equivalently performing scores, individual-level agreement between risk estimates was tested with intraclass correlation (ICC) and Light's Kappa, measures of inter-rater reliability. Results:\u000050 PRSs were calculated for 171,095 AOU participants. When included in a model of prevalent CAD, 48 scores had practically equivalent Brier scores and AUROCs (region of practical equivalence = 0.02). Across these scores, 84% of participants had at least one score in both the top and bottom risk quintile. Continuous agreement of individual risk predictions from the 48 scores was poor, with an ICC of 0.351 (95% CI; 0.349, 0.352). Agreement between two statistically equivalent scores was moderate, with an ICC of 0.649 (95% CI; 0.646, 0.652). Light's Kappa, used to evaluate consistency of assignment to high-risk thresholds, did not exceed 0.56 (interpreted as 'fair') across statistically and practically equivalent scores. Repeating the analysis among 41,193 PMBB and 50,748 UCLA participants yielded different sets of statistically and practically equivalent scores which also lacked strong individual agreement. Conclusions and Relevance:\u0000Across three diverse biobanks, CAD PRSs that performed equivalently at the population level produced unreliable individual risk estimates. Approaches to clinical implementation of CAD PRSs must consider the potential for discordant individual risk estimates from otherwise indistinguishable scores.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141771239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}