Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young
{"title":"Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction","authors":"Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young","doi":"arxiv-2405.06729","DOIUrl":null,"url":null,"abstract":"Protein Language Models (PLMs) have emerged as performant and scalable tools\nfor predicting the functional impact and clinical significance of\nprotein-coding variants, but they still lag experimental accuracy. Here, we\npresent a novel fine-tuning approach to improve the performance of PLMs with\nexperimental maps of variant effects from Deep Mutational Scanning (DMS) assays\nusing a Normalised Log-odds Ratio (NLR) head. We find consistent improvements\nin a held-out protein test set, and on independent DMS and clinical variant\nannotation benchmarks from ProteinGym and ClinVar. These findings demonstrate\nthat DMS is a promising source of sequence diversity and supervised training\ndata for improving the performance of PLMs for variant effect prediction.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.06729","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Protein Language Models (PLMs) have emerged as performant and scalable tools
for predicting the functional impact and clinical significance of
protein-coding variants, but they still lag experimental accuracy. Here, we
present a novel fine-tuning approach to improve the performance of PLMs with
experimental maps of variant effects from Deep Mutational Scanning (DMS) assays
using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements
in a held-out protein test set, and on independent DMS and clinical variant
annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate
that DMS is a promising source of sequence diversity and supervised training
data for improving the performance of PLMs for variant effect prediction.