一举解决蛋白质适宜性估算的伪复杂性问题

arXiv - QuanBio - Genomics Pub Date : 2024-07-09 DOI:arxiv-2407.07265

Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta

{"title":"一举解决蛋白质适宜性估算的伪复杂性问题","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":null,"url":null,"abstract":"Protein language models trained on the masked language modeling objective\nlearn to predict the identity of hidden amino acid residues within a sequence\nusing the remaining observable sequence as context. They do so by embedding the\nresidues into a high dimensional space that encapsulates the relevant\ncontextual cues. These embedding vectors serve as an informative\ncontext-sensitive representation that not only aids with the defined training\nobjective, but can also be used for other tasks by downstream models. We\npropose a scheme to use the embeddings of an unmasked sequence to estimate the\ncorresponding masked probability vectors for all the positions in a single\nforward pass through the language model. This One Fell Swoop (OFS) approach\nallows us to efficiently estimate the pseudo-perplexity of the sequence, a\nmeasure of the model's uncertainty in its predictions, that can also serve as a\nfitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\nwell as the true pseudo-perplexity at fitness estimation, and more notably it\ndefines a new state of the art on the ProteinGym Indels benchmark. The strong\nperformance of the fitness measure prompted us to investigate if it could be\nused to detect the elevated stability reported in reconstructed ancestral\nsequences. We find that this measure ranks ancestral reconstructions as more\nfit than extant sequences. Finally, we show that the computational efficiency\nof the technique allows for the use of Monte Carlo methods that can rapidly\nexplore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation\",\"authors\":\"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta\",\"doi\":\"arxiv-2407.07265\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protein language models trained on the masked language modeling objective\\nlearn to predict the identity of hidden amino acid residues within a sequence\\nusing the remaining observable sequence as context. They do so by embedding the\\nresidues into a high dimensional space that encapsulates the relevant\\ncontextual cues. These embedding vectors serve as an informative\\ncontext-sensitive representation that not only aids with the defined training\\nobjective, but can also be used for other tasks by downstream models. We\\npropose a scheme to use the embeddings of an unmasked sequence to estimate the\\ncorresponding masked probability vectors for all the positions in a single\\nforward pass through the language model. This One Fell Swoop (OFS) approach\\nallows us to efficiently estimate the pseudo-perplexity of the sequence, a\\nmeasure of the model's uncertainty in its predictions, that can also serve as a\\nfitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\\nwell as the true pseudo-perplexity at fitness estimation, and more notably it\\ndefines a new state of the art on the ProteinGym Indels benchmark. The strong\\nperformance of the fitness measure prompted us to investigate if it could be\\nused to detect the elevated stability reported in reconstructed ancestral\\nsequences. We find that this measure ranks ancestral reconstructions as more\\nfit than extant sequences. Finally, we show that the computational efficiency\\nof the technique allows for the use of Monte Carlo methods that can rapidly\\nexplore functional sequence space.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.07265\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

根据掩码语言建模目标训练的蛋白质语言模型，可以利用剩余的可观测序列作为上下文，学习预测序列中隐藏氨基酸残基的身份。这些模型通过将残基嵌入包含相关上下文线索的高维空间来实现这一目的。这些嵌入向量是对上下文敏感的信息表征，不仅有助于实现所定义的训练目标，还能被下游模型用于其他任务。我们提出了一种方案，利用未掩码序列的嵌入向量来估计对应的掩码概率向量，从而在语言模型中对所有位置进行一次前向传递。这种 "一击即中"（OFS）方法允许我们高效地估算序列的伪复杂度，即模型预测不确定性的度量，它也可以用作正确性估算。我们发现，ESM2 OFS 伪复杂度在适配性估计方面的表现几乎与真正的伪复杂度一样好，更值得一提的是，它在 ProteinGym Indels 基准上定义了一种新的技术状态。适配性测量的强大性能促使我们研究是否可以用它来检测重建祖先序列中的高稳定性。我们发现，与现存序列相比，该指标将祖先重建序列列为更适合的序列。最后，我们证明该技术的计算效率允许使用蒙特卡罗方法快速探索功能序列空间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation

Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量