Expert-guided protein language models enable accurate and blazingly fast fitness prediction.

IF 5.4 Bioinformatics (Oxford, England) Pub Date : 2024-11-01 DOI:10.1093/bioinformatics/btae621

Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine

{"title":"Expert-guided protein language models enable accurate and blazingly fast fitness prediction.","authors":"Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine","doi":"10.1093/bioinformatics/btae621","DOIUrl":null,"url":null,"abstract":"Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein language model (pLM) embeddings as input to a minimal deep learning model.Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark (217 multiplex assays of variant effect-MAVE-with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48 ± 0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 min on a consumer laptop (12-core CPU, 16 GB RAM).Availability and implementation: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11588025/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein language model (pLM) embeddings as input to a minimal deep learning model.

Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark (217 multiplex assays of variant effect-MAVE-with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48 ± 0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 min on a consumer laptop (12-core CPU, 16 GB RAM).

Availability and implementation: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

以专家为指导的蛋白质语言模型能够准确快速地预测适合度。

动机对所有已知蛋白质变体的效应进行详尽的实验注释仍然是一项艰巨而昂贵的工作，这就强调了对可扩展效应预测的需求。我们利用蛋白质语言模型（pLM）嵌入作为最小深度学习模型的输入，推出了快速的错义氨基酸变体效应预测器 VespaG：为了克服实验训练数据稀少的问题，我们从人类蛋白质组中创建了一个包含 3,900 万个单氨基酸变体的数据集，并应用基于多序列比对的效应预测器 GEMME 作为伪真理标准。与基线 pLM 相比，这种设置提高了可解释性，而且很容易用新的或更新的 pLM 进行再训练。根据 ProteinGym 基准（217 项变体效应多重检测--MAVE--250 万个变体）进行评估，VespaG 的平均斯皮尔曼相关性为 0.48±0.02，与在相同数据上评估的顶级方法不相上下。VespaG 的优势在于速度快了几个数量级，在一台消费级笔记本电脑（12 核 CPU、16 GB 内存）上预测智人或黑腹果蝇等蛋白质组中所有蛋白质的所有突变景观只需不到 30 分钟：VespaG 可在 https://github.com/jschlensok/vespag 免费获取。相关的训练数据和预测结果可从 https://doi.org/10.5281/zenodo.11085958.Supplementary 信息中获取：补充数据可在 Bioinformatics online 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量