Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation

Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta
{"title":"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":null,"url":null,"abstract":"Protein language models trained on the masked language modeling objective\nlearn to predict the identity of hidden amino acid residues within a sequence\nusing the remaining observable sequence as context. They do so by embedding the\nresidues into a high dimensional space that encapsulates the relevant\ncontextual cues. These embedding vectors serve as an informative\ncontext-sensitive representation that not only aids with the defined training\nobjective, but can also be used for other tasks by downstream models. We\npropose a scheme to use the embeddings of an unmasked sequence to estimate the\ncorresponding masked probability vectors for all the positions in a single\nforward pass through the language model. This One Fell Swoop (OFS) approach\nallows us to efficiently estimate the pseudo-perplexity of the sequence, a\nmeasure of the model's uncertainty in its predictions, that can also serve as a\nfitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\nwell as the true pseudo-perplexity at fitness estimation, and more notably it\ndefines a new state of the art on the ProteinGym Indels benchmark. The strong\nperformance of the fitness measure prompted us to investigate if it could be\nused to detect the elevated stability reported in reconstructed ancestral\nsequences. We find that this measure ranks ancestral reconstructions as more\nfit than extant sequences. Finally, we show that the computational efficiency\nof the technique allows for the use of Monte Carlo methods that can rapidly\nexplore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一举解决蛋白质适宜性估算的伪复杂性问题
根据掩码语言建模目标训练的蛋白质语言模型,可以利用剩余的可观测序列作为上下文,学习预测序列中隐藏氨基酸残基的身份。这些模型通过将残基嵌入包含相关上下文线索的高维空间来实现这一目的。这些嵌入向量是对上下文敏感的信息表征,不仅有助于实现所定义的训练目标,还能被下游模型用于其他任务。我们提出了一种方案,利用未掩码序列的嵌入向量来估计对应的掩码概率向量,从而在语言模型中对所有位置进行一次前向传递。这种 "一击即中"(OFS)方法允许我们高效地估算序列的伪复杂度,即模型预测不确定性的度量,它也可以用作正确性估算。我们发现,ESM2 OFS 伪复杂度在适配性估计方面的表现几乎与真正的伪复杂度一样好,更值得一提的是,它在 ProteinGym Indels 基准上定义了一种新的技术状态。适配性测量的强大性能促使我们研究是否可以用它来检测重建祖先序列中的高稳定性。我们发现,与现存序列相比,该指标将祖先重建序列列为更适合的序列。最后,我们证明该技术的计算效率允许使用蒙特卡罗方法快速探索功能序列空间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1