Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta
{"title":"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":null,"url":null,"abstract":"Protein language models trained on the masked language modeling objective\nlearn to predict the identity of hidden amino acid residues within a sequence\nusing the remaining observable sequence as context. They do so by embedding the\nresidues into a high dimensional space that encapsulates the relevant\ncontextual cues. These embedding vectors serve as an informative\ncontext-sensitive representation that not only aids with the defined training\nobjective, but can also be used for other tasks by downstream models. We\npropose a scheme to use the embeddings of an unmasked sequence to estimate the\ncorresponding masked probability vectors for all the positions in a single\nforward pass through the language model. This One Fell Swoop (OFS) approach\nallows us to efficiently estimate the pseudo-perplexity of the sequence, a\nmeasure of the model's uncertainty in its predictions, that can also serve as a\nfitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\nwell as the true pseudo-perplexity at fitness estimation, and more notably it\ndefines a new state of the art on the ProteinGym Indels benchmark. The strong\nperformance of the fitness measure prompted us to investigate if it could be\nused to detect the elevated stability reported in reconstructed ancestral\nsequences. We find that this measure ranks ancestral reconstructions as more\nfit than extant sequences. Finally, we show that the computational efficiency\nof the technique allows for the use of Monte Carlo methods that can rapidly\nexplore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Protein language models trained on the masked language modeling objective
learn to predict the identity of hidden amino acid residues within a sequence
using the remaining observable sequence as context. They do so by embedding the
residues into a high dimensional space that encapsulates the relevant
contextual cues. These embedding vectors serve as an informative
context-sensitive representation that not only aids with the defined training
objective, but can also be used for other tasks by downstream models. We
propose a scheme to use the embeddings of an unmasked sequence to estimate the
corresponding masked probability vectors for all the positions in a single
forward pass through the language model. This One Fell Swoop (OFS) approach
allows us to efficiently estimate the pseudo-perplexity of the sequence, a
measure of the model's uncertainty in its predictions, that can also serve as a
fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as
well as the true pseudo-perplexity at fitness estimation, and more notably it
defines a new state of the art on the ProteinGym Indels benchmark. The strong
performance of the fitness measure prompted us to investigate if it could be
used to detect the elevated stability reported in reconstructed ancestral
sequences. We find that this measure ranks ancestral reconstructions as more
fit than extant sequences. Finally, we show that the computational efficiency
of the technique allows for the use of Monte Carlo methods that can rapidly
explore functional sequence space.