Masked inverse folding with sequence transfer for protein representation learning.

IF 2.6 4区 生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY Protein Engineering Design & Selection Pub Date : 2023-01-21 DOI:10.1093/protein/gzad015
Kevin K Yang, Niccolò Zanichelli, Hugh Yeh
{"title":"Masked inverse folding with sequence transfer for protein representation learning.","authors":"Kevin K Yang, Niccolò Zanichelli, Hugh Yeh","doi":"10.1093/protein/gzad015","DOIUrl":null,"url":null,"abstract":"<p><p>Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.</p>","PeriodicalId":54543,"journal":{"name":"Protein Engineering Design & Selection","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein Engineering Design & Selection","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/protein/gzad015","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于蛋白质表示学习的带序列转移的掩模反向折叠。
蛋白质序列的自监督预训练已经在蛋白质功能和适应度预测方面取得了最先进的性能。然而,纯序列方法忽略了实验和预测蛋白质结构中包含的丰富信息。同时,反向折叠方法根据蛋白质的结构重建蛋白质的氨基酸序列,但不利用没有已知结构的序列。在这项研究中,我们训练了一个参数化为结构化图神经网络的掩蔽反折叠蛋白质掩蔽语言模型。在预训练过程中,该模型学习以骨干结构为条件重建受损序列。然后,我们表明,使用来自预训练的仅序列蛋白质掩蔽语言模型的输出作为反向折叠模型的输入,进一步改善了预训练的困惑。我们在下游蛋白质工程任务中评估了这两个模型,并分析了使用实验或预测结构的信息对性能的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Protein Engineering Design & Selection
Protein Engineering Design & Selection 生物-生化与分子生物学
CiteScore
3.30
自引率
4.20%
发文量
14
审稿时长
6-12 weeks
期刊介绍: Protein Engineering, Design and Selection (PEDS) publishes high-quality research papers and review articles relevant to the engineering, design and selection of proteins for use in biotechnology and therapy, and for understanding the fundamental link between protein sequence, structure, dynamics, function, and evolution.
期刊最新文献
TIMED-Design: flexible and accessible protein sequence design with convolutional neural networks. Correction to: De novo design of a polycarbonate hydrolase. Interactive computational and experimental approaches improve the sensitivity of periplasmic binding protein-based nicotine biosensors for measurements in biofluids. Design of functional intrinsically disordered proteins. The shortest path method (SPM) webserver for computational enzyme design.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1