Mendelian randomization is a powerful method for inferring causal relationships. However, obtaining suitable genetic instrumental variables is often challenging due to gene interaction, linkage, and pleiotropy. We propose Bayesian network-based Mendelian randomization (BNMR), a Bayesian causal learning and inference framework using individual-level data. BNMR employs the random graph forest, an ensemble Bayesian network structural learning process, to prioritize candidate genetic variants and select appropriate instrumental variables, and then obtains a pleiotropy-robust estimate by incorporating a shrinkage prior in the Bayesian framework. Simulations demonstrate BNMR can efficiently reduce the false-positive discoveries in variant selection, and outperforms existing MR methods in terms of accuracy and statistical power in effect estimation. With application to the UK Biobank, BNMR exhibits its capacity in handling modern genomic data, and reveals the causal relationships from hematological traits to blood pressures and psychiatric disorders. Its effectiveness in handling complex genetic structures and modern genomic data highlights the potential to facilitate real-world evidence studies, making it a promising tool for advancing our understanding of causal mechanisms.
{"title":"Bayesian network-based Mendelian randomization for variant prioritization and phenotypic causal inference.","authors":"Jianle Sun, Jie Zhou, Yuqiao Gong, Chongchen Pang, Yanran Ma, Jian Zhao, Zhangsheng Yu, Yue Zhang","doi":"10.1007/s00439-024-02640-x","DOIUrl":"10.1007/s00439-024-02640-x","url":null,"abstract":"<p><p>Mendelian randomization is a powerful method for inferring causal relationships. However, obtaining suitable genetic instrumental variables is often challenging due to gene interaction, linkage, and pleiotropy. We propose Bayesian network-based Mendelian randomization (BNMR), a Bayesian causal learning and inference framework using individual-level data. BNMR employs the random graph forest, an ensemble Bayesian network structural learning process, to prioritize candidate genetic variants and select appropriate instrumental variables, and then obtains a pleiotropy-robust estimate by incorporating a shrinkage prior in the Bayesian framework. Simulations demonstrate BNMR can efficiently reduce the false-positive discoveries in variant selection, and outperforms existing MR methods in terms of accuracy and statistical power in effect estimation. With application to the UK Biobank, BNMR exhibits its capacity in handling modern genomic data, and reveals the causal relationships from hematological traits to blood pressures and psychiatric disorders. Its effectiveness in handling complex genetic structures and modern genomic data highlights the potential to facilitate real-world evidence studies, making it a promising tool for advancing our understanding of causal mechanisms.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"1081-1094"},"PeriodicalIF":3.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139912502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01Epub Date: 2024-03-20DOI: 10.1007/s00439-024-02661-6
Mengling Qi, Haoyang Zhang, Xuehao Xiu, Dan He, David N Cooper, Yuanhao Yang, Huiying Zhao
Aims Many studies indicated use of diabetes medications can influence the electrocardiogram (ECG), which remains the simplest and fastest tool for assessing cardiac functions. However, few studies have explored the role of genetic factors in determining the relationship between the use of diabetes medications and ECG trace characteristics (ETC). Methods Genome-wide association studies (GWAS) were performed for 168 ETCs extracted from the 12-lead ECGs of 42,340 Europeans in the UK Biobank. The genetic correlations, causal relationships, and phenotypic relationships of these ETCs with medication usage, as well as the risk of cardiovascular diseases (CVDs), were estimated by linkage disequilibrium score regression (LDSC), Mendelian randomization (MR), and regression model, respectively. Results The GWAS identified 124 independent single nucleotide polymorphisms (SNPs) that were study-wise and genome-wide significantly associated with at least one ETC. Regression model and LDSC identified significant phenotypic and genetic correlations of T-wave area in lead aVR (aVR_T-area) with usage of diabetes medications (ATC code: A10 drugs, and metformin), and the risks of ischemic heart disease (IHD) and coronary atherosclerosis (CA). MR analyses support a putative causal effect of the use of diabetes medications on decreasing aVR_T-area, and on increasing risk of IHD and CA. ConclusionPatients taking diabetes medications are prone to have decreased aVR_T-area and an increased risk of IHD and CA. The aVR_T-area is therefore a potential ECG marker for pre-clinical prediction of IHD and CA in patients taking diabetes medications.
{"title":"Genetic evidence for T-wave area from 12-lead electrocardiograms to monitor cardiovascular diseases in patients taking diabetes medications.","authors":"Mengling Qi, Haoyang Zhang, Xuehao Xiu, Dan He, David N Cooper, Yuanhao Yang, Huiying Zhao","doi":"10.1007/s00439-024-02661-6","DOIUrl":"10.1007/s00439-024-02661-6","url":null,"abstract":"<p><p>Aims Many studies indicated use of diabetes medications can influence the electrocardiogram (ECG), which remains the simplest and fastest tool for assessing cardiac functions. However, few studies have explored the role of genetic factors in determining the relationship between the use of diabetes medications and ECG trace characteristics (ETC). Methods Genome-wide association studies (GWAS) were performed for 168 ETCs extracted from the 12-lead ECGs of 42,340 Europeans in the UK Biobank. The genetic correlations, causal relationships, and phenotypic relationships of these ETCs with medication usage, as well as the risk of cardiovascular diseases (CVDs), were estimated by linkage disequilibrium score regression (LDSC), Mendelian randomization (MR), and regression model, respectively. Results The GWAS identified 124 independent single nucleotide polymorphisms (SNPs) that were study-wise and genome-wide significantly associated with at least one ETC. Regression model and LDSC identified significant phenotypic and genetic correlations of T-wave area in lead aVR (aVR_T-area) with usage of diabetes medications (ATC code: A10 drugs, and metformin), and the risks of ischemic heart disease (IHD) and coronary atherosclerosis (CA). MR analyses support a putative causal effect of the use of diabetes medications on decreasing aVR_T-area, and on increasing risk of IHD and CA. ConclusionPatients taking diabetes medications are prone to have decreased aVR_T-area and an increased risk of IHD and CA. The aVR_T-area is therefore a potential ECG marker for pre-clinical prediction of IHD and CA in patients taking diabetes medications.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"1095-1108"},"PeriodicalIF":3.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140174493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pancreatic ductal adenocarcinoma (PDAC) is a malignant tumor with poor prognosis and high mortality. Although a large number of studies have explored its potential prognostic markers using traditional RNA sequencing (RNA-Seq) data, they have not achieved good prediction effect. In order to explore the possible prognostic signaling pathways leading to the difference in prognosis, we identified differentially expressed genes from one scRNA-seq cohort and four GEO cohorts, respectively. Then Cox and Lasso regression analysis showed that 12 genes were independent prognostic factors for PDAC. AUC and calibration curve analysis showed that the prognostic model had good discrimination and calibration. Compared with the low-risk group, the high-risk group had a higher proportion of gene mutations than the low-risk group. Immune infiltration analysis revealed differences in macrophages and monocytes between the two groups. Prognosis related genes were mainly distributed in fibroblasts, macrophages and type 2 ducts. The results of cell communication analysis showed that there was a strong communication between cancer-associated fibroblasts (CAF) and type 2 ductal cells, and collagen formation was the main interaction pathway.
{"title":"The crucial prognostic signaling pathways of pancreatic ductal adenocarcinoma were identified by single-cell and bulk RNA sequencing data.","authors":"Wenwen Wang, Guo Chen, Wenli Zhang, Xihua Zhang, Manli Huang, Chen Li, Ling Wang, Zifan Lu, Jielai Xia","doi":"10.1007/s00439-024-02663-4","DOIUrl":"10.1007/s00439-024-02663-4","url":null,"abstract":"<p><p>Pancreatic ductal adenocarcinoma (PDAC) is a malignant tumor with poor prognosis and high mortality. Although a large number of studies have explored its potential prognostic markers using traditional RNA sequencing (RNA-Seq) data, they have not achieved good prediction effect. In order to explore the possible prognostic signaling pathways leading to the difference in prognosis, we identified differentially expressed genes from one scRNA-seq cohort and four GEO cohorts, respectively. Then Cox and Lasso regression analysis showed that 12 genes were independent prognostic factors for PDAC. AUC and calibration curve analysis showed that the prognostic model had good discrimination and calibration. Compared with the low-risk group, the high-risk group had a higher proportion of gene mutations than the low-risk group. Immune infiltration analysis revealed differences in macrophages and monocytes between the two groups. Prognosis related genes were mainly distributed in fibroblasts, macrophages and type 2 ducts. The results of cell communication analysis showed that there was a strong communication between cancer-associated fibroblasts (CAF) and type 2 ductal cells, and collagen formation was the main interaction pathway.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"1109-1129"},"PeriodicalIF":3.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11485037/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140287333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cisplatin-induced acute kidney injury (CP-AKI) is a common complication in cancer patients. Although ferroptosis is believed to contribute to the progression of CP-AKI, its mechanisms remain incompletely understood. In this study, after initially processed individual omics datasets, we integrated multi-omics data to construct a ferroptosis network in the kidney, resulting in the identification of the key driver TACSTD2. In vitro and in vivo results showed that TACSTD2 was notably upregulated in cisplatin-treated kidneys and BUMPT cells. Overexpression of TACSTD2 accelerated ferroptosis, while its gene disruption decelerated ferroptosis, likely mediated by its potential downstream targets HMGB1, IRF6, and LCN2. Drug prediction and molecular docking were further used to propose that drugs targeting TACSTD2 may have therapeutic potential in CP-AKI, such as parthenolide, progesterone, premarin, estradiol and rosiglitazone. Our findings suggest a significant association between ferroptosis and the development of CP-AKI, with TACSTD2 playing a crucial role in modulating ferroptosis, which provides novel perspectives on the pathogenesis and treatment of CP-AKI.
{"title":"Identification of TACSTD2 as novel therapeutic targets for cisplatin-induced acute kidney injury by multi-omics data integration.","authors":"Zebin Deng, Zheng Dong, Yinhuai Wang, Yingbo Dai, Jiachen Liu, Fei Deng","doi":"10.1007/s00439-024-02641-w","DOIUrl":"10.1007/s00439-024-02641-w","url":null,"abstract":"<p><p>Cisplatin-induced acute kidney injury (CP-AKI) is a common complication in cancer patients. Although ferroptosis is believed to contribute to the progression of CP-AKI, its mechanisms remain incompletely understood. In this study, after initially processed individual omics datasets, we integrated multi-omics data to construct a ferroptosis network in the kidney, resulting in the identification of the key driver TACSTD2. In vitro and in vivo results showed that TACSTD2 was notably upregulated in cisplatin-treated kidneys and BUMPT cells. Overexpression of TACSTD2 accelerated ferroptosis, while its gene disruption decelerated ferroptosis, likely mediated by its potential downstream targets HMGB1, IRF6, and LCN2. Drug prediction and molecular docking were further used to propose that drugs targeting TACSTD2 may have therapeutic potential in CP-AKI, such as parthenolide, progesterone, premarin, estradiol and rosiglitazone. Our findings suggest a significant association between ferroptosis and the development of CP-AKI, with TACSTD2 playing a crucial role in modulating ferroptosis, which provides novel perspectives on the pathogenesis and treatment of CP-AKI.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"1061-1080"},"PeriodicalIF":3.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139899662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01Epub Date: 2024-07-06DOI: 10.1007/s00439-024-02684-z
Julia Ramírez, Stefan van Duijvenboden, William J Young, Yutang Chen, Tania Usman, Michele Orini, Pier D Lambiase, Andrew Tinker, Christopher G Bell, Andrew P Morris, Patricia B Munroe
An elevated resting heart rate (RHR) is associated with increased cardiovascular mortality. Genome-wide association studies (GWAS) have identified > 350 loci. Uniquely, in this study we applied genetic fine-mapping leveraging tissue specific chromatin segmentation and colocalization analyses to identify causal variants and candidate effector genes for RHR. We used RHR GWAS summary statistics from 388,237 individuals of European ancestry from UK Biobank and performed fine mapping using publicly available genomic annotation datasets. High-confidence causal variants (accounting for > 75% posterior probability) were identified, and we collated candidate effector genes using a multi-omics approach that combined evidence from colocalisation with molecular quantitative trait loci (QTLs), and long-range chromatin interaction analyses. Finally, we performed druggability analyses to investigate drug repurposing opportunities. The fine mapping pipeline indicated 442 distinct RHR signals. For 90 signals, a single variant was identified as a high-confidence causal variant, of which 22 were annotated as missense. In trait-relevant tissues, 39 signals colocalised with cis-expression QTLs (eQTLs), 3 with cis-protein QTLs (pQTLs), and 75 had promoter interactions via Hi-C. In total, 262 candidate genes were highlighted (79% had promoter interactions, 15% had a colocalised eQTL, 8% had a missense variant and 1% had a colocalised pQTL), and, for the first time, enrichment in nervous system pathways. Druggability analyses highlighted ACHE, CALCRL, MYT1 and TDP1 as potential targets. Our genetic fine-mapping pipeline prioritised 262 candidate genes for RHR that warrant further investigation in functional studies, and we provide potential therapeutic targets to reduce RHR and cardiovascular mortality.
{"title":"Fine mapping of candidate effector genes for heart rate.","authors":"Julia Ramírez, Stefan van Duijvenboden, William J Young, Yutang Chen, Tania Usman, Michele Orini, Pier D Lambiase, Andrew Tinker, Christopher G Bell, Andrew P Morris, Patricia B Munroe","doi":"10.1007/s00439-024-02684-z","DOIUrl":"10.1007/s00439-024-02684-z","url":null,"abstract":"<p><p>An elevated resting heart rate (RHR) is associated with increased cardiovascular mortality. Genome-wide association studies (GWAS) have identified > 350 loci. Uniquely, in this study we applied genetic fine-mapping leveraging tissue specific chromatin segmentation and colocalization analyses to identify causal variants and candidate effector genes for RHR. We used RHR GWAS summary statistics from 388,237 individuals of European ancestry from UK Biobank and performed fine mapping using publicly available genomic annotation datasets. High-confidence causal variants (accounting for > 75% posterior probability) were identified, and we collated candidate effector genes using a multi-omics approach that combined evidence from colocalisation with molecular quantitative trait loci (QTLs), and long-range chromatin interaction analyses. Finally, we performed druggability analyses to investigate drug repurposing opportunities. The fine mapping pipeline indicated 442 distinct RHR signals. For 90 signals, a single variant was identified as a high-confidence causal variant, of which 22 were annotated as missense. In trait-relevant tissues, 39 signals colocalised with cis-expression QTLs (eQTLs), 3 with cis-protein QTLs (pQTLs), and 75 had promoter interactions via Hi-C. In total, 262 candidate genes were highlighted (79% had promoter interactions, 15% had a colocalised eQTL, 8% had a missense variant and 1% had a colocalised pQTL), and, for the first time, enrichment in nervous system pathways. Druggability analyses highlighted ACHE, CALCRL, MYT1 and TDP1 as potential targets. Our genetic fine-mapping pipeline prioritised 262 candidate genes for RHR that warrant further investigation in functional studies, and we provide potential therapeutic targets to reduce RHR and cardiovascular mortality.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"1207-1221"},"PeriodicalIF":3.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11485034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141537867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1007/s00439-024-02701-1
Sandeep Acharya, Shu Liao, Wooseok J. Jung, Yu S. Kang, Vaha Akbary Moghaddam, Mary F. Feitosa, Mary K. Wojczynski, Shiow Lin, Jason A. Anema, Karen Schwander, Jeff O. Connell, Michael A. Province, Michael R. Brent
The Long Life Family Study (LLFS) enrolled 4953 participants in 539 pedigrees displaying exceptional longevity. To identify genetic mechanisms that affect cardiovascular risks in the LLFS population, we developed a multi-omics integration pipeline and applied it to 11 traits associated with cardiovascular risks. Using our pipeline, we aggregated gene-level statistics from rare-variant analysis, GWAS, and gene expression-trait association by Correlated Meta-Analysis (CMA). Across all traits, CMA identified 64 significant genes after Bonferroni correction (p ≤ 2.8 × 10–7), 29 of which replicated in the Framingham Heart Study (FHS) cohort. Notably, 20 of the 29 replicated genes do not have a previously known trait-associated variant in the GWAS Catalog within 50 kb. Thirteen modules in Protein–Protein Interaction (PPI) networks are significantly enriched in genes with low meta-analysis p-values for at least one trait, three of which are replicated in the FHS cohort. The functional annotation of genes in these modules showed a significant over-representation of trait-related biological processes including sterol transport, protein-lipid complex remodeling, and immune response regulation. Among major findings, our results suggest a role of triglyceride-associated and mast-cell functional genes FCER1A, MS4A2, GATA2, HDC, and HRH4 in atherosclerosis risks. Our findings also suggest that lower expression of ATG2A, a gene we found to be associated with BMI, may be both a cause and consequence of obesity. Finally, our results suggest that ENPP3 may play an intermediary role in triglyceride-induced inflammation. Our pipeline is freely available and implemented in the Nextflow workflow language, making it easily runnable on any compute platform (https://nf-co.re/omicsgenetraitassociation).
{"title":"A methodology for gene level omics-WAS integration identifies genes influencing traits associated with cardiovascular risks: the Long Life Family Study","authors":"Sandeep Acharya, Shu Liao, Wooseok J. Jung, Yu S. Kang, Vaha Akbary Moghaddam, Mary F. Feitosa, Mary K. Wojczynski, Shiow Lin, Jason A. Anema, Karen Schwander, Jeff O. Connell, Michael A. Province, Michael R. Brent","doi":"10.1007/s00439-024-02701-1","DOIUrl":"https://doi.org/10.1007/s00439-024-02701-1","url":null,"abstract":"<p>The Long Life Family Study (LLFS) enrolled 4953 participants in 539 pedigrees displaying exceptional longevity. To identify genetic mechanisms that affect cardiovascular risks in the LLFS population, we developed a multi-omics integration pipeline and applied it to 11 traits associated with cardiovascular risks. Using our pipeline, we aggregated gene-level statistics from rare-variant analysis, GWAS, and gene expression-trait association by Correlated Meta-Analysis (CMA). Across all traits, CMA identified 64 significant genes after Bonferroni correction (p ≤ 2.8 × 10<sup>–7</sup>), 29 of which replicated in the Framingham Heart Study (FHS) cohort. Notably, 20 of the 29 replicated genes do not have a previously known trait-associated variant in the GWAS Catalog within 50 kb. Thirteen modules in Protein–Protein Interaction (PPI) networks are significantly enriched in genes with low meta-analysis p-values for at least one trait, three of which are replicated in the FHS cohort. The functional annotation of genes in these modules showed a significant over-representation of trait-related biological processes including sterol transport, protein-lipid complex remodeling, and immune response regulation. Among major findings, our results suggest a role of triglyceride-associated and mast-cell functional genes <i>FCER1A</i>, <i>MS4A2</i>, <i>GATA2</i>, <i>HDC</i>, and <i>HRH4</i> in atherosclerosis risks. Our findings also suggest that lower expression of <i>ATG2A</i>, a gene we found to be associated with BMI, may be both a cause and consequence of obesity. Finally, our results suggest that <i>ENPP3</i> may play an intermediary role in triglyceride-induced inflammation. Our pipeline is freely available and implemented in the Nextflow workflow language, making it easily runnable on any compute platform (https://nf-co.re/omicsgenetraitassociation<u>)</u>.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":"15 1","pages":""},"PeriodicalIF":5.3,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1007/s00439-024-02695-w
Yuanfei Sun, Yang Shen
Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.
{"title":"Structure-informed protein language models are robust predictors for variant effects.","authors":"Yuanfei Sun, Yang Shen","doi":"10.1007/s00439-024-02695-w","DOIUrl":"10.1007/s00439-024-02695-w","url":null,"abstract":"<p><p>Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141906463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s00439-024-02680-3
Jing Zhang, Lisa Kinch, Panagiotis Katsonis, Olivier Lichtarge, Milind Jagota, Yun S Song, Yuanfei Sun, Yang Shen, Nurdan Kuru, Onur Dereli, Ogun Adebali, Muttaqi Ahmad Alladin, Debnath Pal, Emidio Capriotti, Maria Paola Turina, Castrense Savojardo, Pier Luigi Martelli, Giulia Babbi, Rita Casadio, Fabrizio Pucci, Marianne Rooman, Gabriel Cia, Matsvei Tsishyn, Alexey Strokach, Zhiqiang Hu, Warren van Loggerenberg, Frederick P Roth, Predrag Radivojac, Steven E Brenner, Qian Cong, Nick V Grishin
This paper presents an evaluation of predictions submitted for the "HMBS" challenge, a component of the sixth round of the Critical Assessment of Genome Interpretation held in 2021. The challenge required participants to predict the effects of missense variants of the human HMBS gene on yeast growth. The HMBS enzyme, critical for the biosynthesis of heme in eukaryotic cells, is highly conserved among eukaryotes. Despite the application of a variety of algorithms and methods, the performance of predictors was relatively similar, with Kendall's tau correlation coefficients between predictions and experimental scores around 0.3 for a majority of submissions. Notably, the median correlation (≥ 0.34) observed among these predictors, especially the top predictions from different groups, was greater than the correlation observed between their predictions and the actual experimental results. Most predictors were moderately successful in distinguishing between deleterious and benign variants, as evidenced by an area under the receiver operating characteristic (ROC) curve (AUC) of approximately 0.7 respectively. Compared with the recent two rounds of CAGI competitions, we noticed more predictors outperformed the baseline predictor, which is solely based on the amino acid frequencies. Nevertheless, the overall accuracy of predictions is still far short of positive control, which is derived from experimental scores, indicating the necessity for considerable improvements in the field. The most inaccurately predicted variants in this round were associated with the insertion loop, which is absent in many orthologs, suggesting the predictors still heavily rely on the information from multiple sequence alignment.
{"title":"Assessing predictions on fitness effects of missense variants in HMBS in CAGI6.","authors":"Jing Zhang, Lisa Kinch, Panagiotis Katsonis, Olivier Lichtarge, Milind Jagota, Yun S Song, Yuanfei Sun, Yang Shen, Nurdan Kuru, Onur Dereli, Ogun Adebali, Muttaqi Ahmad Alladin, Debnath Pal, Emidio Capriotti, Maria Paola Turina, Castrense Savojardo, Pier Luigi Martelli, Giulia Babbi, Rita Casadio, Fabrizio Pucci, Marianne Rooman, Gabriel Cia, Matsvei Tsishyn, Alexey Strokach, Zhiqiang Hu, Warren van Loggerenberg, Frederick P Roth, Predrag Radivojac, Steven E Brenner, Qian Cong, Nick V Grishin","doi":"10.1007/s00439-024-02680-3","DOIUrl":"10.1007/s00439-024-02680-3","url":null,"abstract":"<p><p>This paper presents an evaluation of predictions submitted for the \"HMBS\" challenge, a component of the sixth round of the Critical Assessment of Genome Interpretation held in 2021. The challenge required participants to predict the effects of missense variants of the human HMBS gene on yeast growth. The HMBS enzyme, critical for the biosynthesis of heme in eukaryotic cells, is highly conserved among eukaryotes. Despite the application of a variety of algorithms and methods, the performance of predictors was relatively similar, with Kendall's tau correlation coefficients between predictions and experimental scores around 0.3 for a majority of submissions. Notably, the median correlation (≥ 0.34) observed among these predictors, especially the top predictions from different groups, was greater than the correlation observed between their predictions and the actual experimental results. Most predictors were moderately successful in distinguishing between deleterious and benign variants, as evidenced by an area under the receiver operating characteristic (ROC) curve (AUC) of approximately 0.7 respectively. Compared with the recent two rounds of CAGI competitions, we noticed more predictors outperformed the baseline predictor, which is solely based on the amino acid frequencies. Nevertheless, the overall accuracy of predictions is still far short of positive control, which is derived from experimental scores, indicating the necessity for considerable improvements in the field. The most inaccurately predicted variants in this round were associated with the insertion loop, which is absent in many orthologs, suggesting the predictors still heavily rely on the information from multiple sequence alignment.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141897332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1007/s00439-024-02691-0
Samskruthi Reddy Padigepati, David A Stafford, Christopher A Tan, Melanie R Silvis, Kirsty Jamieson, Andrew Keyser, Paola Alejandra Correa Nunez, John M Nicoludis, Toby Manders, Laure Fresard, Yuya Kobayashi, Carlos L Araya, Swaroop Aradhya, Britt Johnson, Keith Nykamp, Jason A Reuter
As the adoption and scope of genetic testing continue to expand, interpreting the clinical significance of DNA sequence variants at scale remains a formidable challenge, with a high proportion classified as variants of uncertain significance (VUSs). Genetic testing laboratories have historically relied, in part, on functional data from academic literature to support variant classification. High-throughput functional assays or multiplex assays of variant effect (MAVEs), designed to assess the effects of DNA variants on protein stability and function, represent an important and increasingly available source of evidence for variant classification, but their potential is just beginning to be realized in clinical lab settings. Here, we describe a framework for generating, validating and incorporating data from MAVEs into a semi-quantitative variant classification method applied to clinical genetic testing. Using single-cell gene expression measurements, cellular evidence models were built to assess the effects of DNA variation in 44 genes of clinical interest. This framework was also applied to models for an additional 22 genes with previously published MAVE datasets. In total, modeling data was incorporated from 24 genes into our variant classification method. These data contributed evidence for classifying 4043 observed variants in over 57,000 individuals. Genetic testing laboratories are uniquely positioned to generate, analyze, validate, and incorporate evidence from high-throughput functional data and ultimately enable the use of these data to provide definitive clinical variant classifications for more patients.
随着基因检测的应用和范围不断扩大,如何大规模解释 DNA 序列变异的临床意义仍是一项艰巨的挑战,其中很大一部分被归类为意义不确定的变异(VUS)。基因检测实验室历来部分依赖学术文献中的功能数据来支持变异分类。高通量功能检测或变异效应多重检测(MAVEs)旨在评估DNA变异对蛋白质稳定性和功能的影响,是变异分类的一个重要且日益可用的证据来源,但其潜力在临床实验室环境中才刚刚开始发挥出来。在这里,我们描述了一个用于生成、验证 MAVE 数据并将其纳入应用于临床基因检测的半定量变异分类方法的框架。通过单细胞基因表达测量,我们建立了细胞证据模型,以评估 44 个临床相关基因中 DNA 变异的影响。这一框架还应用于另外 22 个基因的模型,这些基因都有先前发表的 MAVE 数据集。在我们的变异分类方法中,总共纳入了 24 个基因的建模数据。这些数据为对 57,000 多人中的 4043 个观察到的变异进行分类提供了证据。基因检测实验室在生成、分析、验证和整合高通量功能数据证据方面具有得天独厚的优势,并能最终利用这些数据为更多患者提供明确的临床变异分类。
{"title":"Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification.","authors":"Samskruthi Reddy Padigepati, David A Stafford, Christopher A Tan, Melanie R Silvis, Kirsty Jamieson, Andrew Keyser, Paola Alejandra Correa Nunez, John M Nicoludis, Toby Manders, Laure Fresard, Yuya Kobayashi, Carlos L Araya, Swaroop Aradhya, Britt Johnson, Keith Nykamp, Jason A Reuter","doi":"10.1007/s00439-024-02691-0","DOIUrl":"10.1007/s00439-024-02691-0","url":null,"abstract":"<p><p>As the adoption and scope of genetic testing continue to expand, interpreting the clinical significance of DNA sequence variants at scale remains a formidable challenge, with a high proportion classified as variants of uncertain significance (VUSs). Genetic testing laboratories have historically relied, in part, on functional data from academic literature to support variant classification. High-throughput functional assays or multiplex assays of variant effect (MAVEs), designed to assess the effects of DNA variants on protein stability and function, represent an important and increasingly available source of evidence for variant classification, but their potential is just beginning to be realized in clinical lab settings. Here, we describe a framework for generating, validating and incorporating data from MAVEs into a semi-quantitative variant classification method applied to clinical genetic testing. Using single-cell gene expression measurements, cellular evidence models were built to assess the effects of DNA variation in 44 genes of clinical interest. This framework was also applied to models for an additional 22 genes with previously published MAVE datasets. In total, modeling data was incorporated from 24 genes into our variant classification method. These data contributed evidence for classifying 4043 observed variants in over 57,000 individuals. Genetic testing laboratories are uniquely positioned to generate, analyze, validate, and incorporate evidence from high-throughput functional data and ultimately enable the use of these data to provide definitive clinical variant classifications for more patients.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"995-1004"},"PeriodicalIF":3.8,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11303574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141859632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01Epub Date: 2024-07-19DOI: 10.1007/s00439-024-02688-9
Elisabeth Bosch, Esther Güse, Philipp Kirchner, Andreas Winterpacht, Mona Walther, Marielle Alders, Jennifer Kerkhof, Arif B Ekici, Heinrich Sticht, Bekim Sadikovic, André Reis, Georgia Vasileiou
ARID1B is the most frequently mutated gene in Coffin-Siris syndrome (CSS). To date, the vast majority of causative variants reported in ARID1B are truncating, leading to nonsense-mediated mRNA decay. In the absence of experimental data, only few ARID1B amino acid substitutions have been classified as pathogenic, mainly based on clinical data and their de novo occurrence, while most others are currently interpreted as variants of unknown significance. The present study substantiates the pathogenesis of ARID1B non-truncating/NMD-escaping variants located in the SMARCA4-interacting EHD2 and DNA-binding ARID domains. Overexpression assays in cell lines revealed that the majority of EHD2 variants lead to protein misfolding and formation of cytoplasmic aggresomes surrounded by vimentin cage-like structures and co-localizing with the microtubule organisation center. ARID domain variants exhibited not only aggresomes, but also nuclear aggregates, demonstrating robust pathological effects. Protein levels were not compromised, as shown by quantitative western blot analysis. In silico structural analysis predicted the exposure of amylogenic segments in both domains due to the nearby variants, likely causing this aggregation. Genome-wide transcriptome and methylation analysis in affected individuals revealed expression and methylome patterns consistent with those of the pathogenic haploinsufficiency ARID1B alterations in CSS cases. These results further support pathogenicity and indicate two approaches for disambiguation of such variants in everyday practice. The few affected individuals harbouring EHD2 non-truncating variants described to date exhibit mild CSS clinical traits. In summary, this study paves the way for the re-evaluation of previously unclear ARID1B non-truncating variants and opens a new era in CSS genetic diagnosis.
{"title":"The missing link: ARID1B non-truncating variants causing Coffin-Siris syndrome due to protein aggregation.","authors":"Elisabeth Bosch, Esther Güse, Philipp Kirchner, Andreas Winterpacht, Mona Walther, Marielle Alders, Jennifer Kerkhof, Arif B Ekici, Heinrich Sticht, Bekim Sadikovic, André Reis, Georgia Vasileiou","doi":"10.1007/s00439-024-02688-9","DOIUrl":"10.1007/s00439-024-02688-9","url":null,"abstract":"<p><p>ARID1B is the most frequently mutated gene in Coffin-Siris syndrome (CSS). To date, the vast majority of causative variants reported in ARID1B are truncating, leading to nonsense-mediated mRNA decay. In the absence of experimental data, only few ARID1B amino acid substitutions have been classified as pathogenic, mainly based on clinical data and their de novo occurrence, while most others are currently interpreted as variants of unknown significance. The present study substantiates the pathogenesis of ARID1B non-truncating/NMD-escaping variants located in the SMARCA4-interacting EHD2 and DNA-binding ARID domains. Overexpression assays in cell lines revealed that the majority of EHD2 variants lead to protein misfolding and formation of cytoplasmic aggresomes surrounded by vimentin cage-like structures and co-localizing with the microtubule organisation center. ARID domain variants exhibited not only aggresomes, but also nuclear aggregates, demonstrating robust pathological effects. Protein levels were not compromised, as shown by quantitative western blot analysis. In silico structural analysis predicted the exposure of amylogenic segments in both domains due to the nearby variants, likely causing this aggregation. Genome-wide transcriptome and methylation analysis in affected individuals revealed expression and methylome patterns consistent with those of the pathogenic haploinsufficiency ARID1B alterations in CSS cases. These results further support pathogenicity and indicate two approaches for disambiguation of such variants in everyday practice. The few affected individuals harbouring EHD2 non-truncating variants described to date exhibit mild CSS clinical traits. In summary, this study paves the way for the re-evaluation of previously unclear ARID1B non-truncating variants and opens a new era in CSS genetic diagnosis.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"965-978"},"PeriodicalIF":3.8,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11303441/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141723537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}