An augmented transformer model trained on protein family specific variant data leads to improved prediction of variants of uncertain significance.

IF 3.8 2区 生物学 Q2 GENETICS & HEREDITY Human Genetics Pub Date : 2025-01-27 DOI:10.1007/s00439-025-02727-z
Dinesh Joshi, Swatantra Pradhan, Rakshanda Sajeed, Rajgopal Srinivasan, Sadhna Rana
{"title":"An augmented transformer model trained on protein family specific variant data leads to improved prediction of variants of uncertain significance.","authors":"Dinesh Joshi, Swatantra Pradhan, Rakshanda Sajeed, Rajgopal Srinivasan, Sadhna Rana","doi":"10.1007/s00439-025-02727-z","DOIUrl":null,"url":null,"abstract":"<p><p>Variants of uncertain significance (VUS) represent variants that lack sufficient evidence to be confidently associated with a disease, thus posing a challenge in the interpretation of genetic testing results. Here we report an improved method for predicting the VUS of Arylsulfatase A (ARSA) gene as part of the Critical Assessment of Genome Interpretation challenge (CAGI6). Our method uses a transfer learning approach that leverages a pre-trained protein language model to predict the impact of mutations on the activity of the ARSA enzyme, whose deficiency is known to cause a rare genetic disorder, metachromatic leukodystrophy. Our innovative framework combines zero-shot log odds scores and embeddings from the ESM, an evolutionary scale model as features for training a supervised model on gene variants functionally related to the ARSA gene. The zero-shot log odds score feature captures the generic properties of the proteins learned due to its pre-training on millions of sequences in the UniProt data, while the ESM embeddings for the proteins in the ARSA family capture features specific to the family. We also tested our approach on another enzyme, N-acetyl-glucosaminidase (NAGLU), that belongs to the same superfamily as ARSA. Our results demonstrate that the performance of our family models (augmented ESM models) is either comparable or better than the ESM models. The ARSA model compares favorably with the majority of state-of-the-art predictors on area under precision and recall curve (AUPRC) performance metric. However, the NAGLU model outperforms all pathogenicity predictors evaluated in this study on AUPRC metric. The improved AUPRC has relevance in a diagnostic setting where variant prioritization generally entails identifying a small number of pathogenic variants from a larger number of benign variants. Our results also indicate that genes that have sparse or no experimental variant impact data, the family variant data can serve as a proxy training data for making accurate predictions. Attention analysis of active sites and binding sites in ARSA and NAGLU proteins shed light on probable mechanisms of pathogenicity for positions that are highly attended.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00439-025-02727-z","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Variants of uncertain significance (VUS) represent variants that lack sufficient evidence to be confidently associated with a disease, thus posing a challenge in the interpretation of genetic testing results. Here we report an improved method for predicting the VUS of Arylsulfatase A (ARSA) gene as part of the Critical Assessment of Genome Interpretation challenge (CAGI6). Our method uses a transfer learning approach that leverages a pre-trained protein language model to predict the impact of mutations on the activity of the ARSA enzyme, whose deficiency is known to cause a rare genetic disorder, metachromatic leukodystrophy. Our innovative framework combines zero-shot log odds scores and embeddings from the ESM, an evolutionary scale model as features for training a supervised model on gene variants functionally related to the ARSA gene. The zero-shot log odds score feature captures the generic properties of the proteins learned due to its pre-training on millions of sequences in the UniProt data, while the ESM embeddings for the proteins in the ARSA family capture features specific to the family. We also tested our approach on another enzyme, N-acetyl-glucosaminidase (NAGLU), that belongs to the same superfamily as ARSA. Our results demonstrate that the performance of our family models (augmented ESM models) is either comparable or better than the ESM models. The ARSA model compares favorably with the majority of state-of-the-art predictors on area under precision and recall curve (AUPRC) performance metric. However, the NAGLU model outperforms all pathogenicity predictors evaluated in this study on AUPRC metric. The improved AUPRC has relevance in a diagnostic setting where variant prioritization generally entails identifying a small number of pathogenic variants from a larger number of benign variants. Our results also indicate that genes that have sparse or no experimental variant impact data, the family variant data can serve as a proxy training data for making accurate predictions. Attention analysis of active sites and binding sites in ARSA and NAGLU proteins shed light on probable mechanisms of pathogenicity for positions that are highly attended.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
Human Genetics
Human Genetics 生物-遗传学
CiteScore
10.80
自引率
3.80%
发文量
94
审稿时长
1 months
期刊介绍: Human Genetics is a monthly journal publishing original and timely articles on all aspects of human genetics. The Journal particularly welcomes articles in the areas of Behavioral genetics, Bioinformatics, Cancer genetics and genomics, Cytogenetics, Developmental genetics, Disease association studies, Dysmorphology, ELSI (ethical, legal and social issues), Evolutionary genetics, Gene expression, Gene structure and organization, Genetics of complex diseases and epistatic interactions, Genetic epidemiology, Genome biology, Genome structure and organization, Genotype-phenotype relationships, Human Genomics, Immunogenetics and genomics, Linkage analysis and genetic mapping, Methods in Statistical Genetics, Molecular diagnostics, Mutation detection and analysis, Neurogenetics, Physical mapping and Population Genetics. Articles reporting animal models relevant to human biology or disease are also welcome. Preference will be given to those articles which address clinically relevant questions or which provide new insights into human biology. Unless reporting entirely novel and unusual aspects of a topic, clinical case reports, cytogenetic case reports, papers on descriptive population genetics, articles dealing with the frequency of polymorphisms or additional mutations within genes in which numerous lesions have already been described, and papers that report meta-analyses of previously published datasets will normally not be accepted. The Journal typically will not consider for publication manuscripts that report merely the isolation, map position, structure, and tissue expression profile of a gene of unknown function unless the gene is of particular interest or is a candidate gene involved in a human trait or disorder.
期刊最新文献
Unilateral, bilateral symmetric or asymmetric isolated hearing loss in patients with heterozygous KITLG variants. Global dysregulation of circular RNAs in frontal cortex and whole blood from DM1 and DM2. An augmented transformer model trained on protein family specific variant data leads to improved prediction of variants of uncertain significance. Conventional and genetic association between migraine and stroke with druggable genome-wide Mendelian randomization. CAGI6 ID panel challenge: assessment of phenotype and variant predictions in 415 children with neurodevelopmental disorders (NDDs).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1