Masked inverse folding with sequence transfer for protein representation learning.

IF 3.4 4区生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY Protein Engineering Design & Selection Pub Date : 2023-01-21 DOI:10.1093/protein/gzad015

Kevin K Yang, Niccolò Zanichelli, Hugh Yeh

引用次数: 0

Abstract

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于蛋白质表示学习的带序列转移的掩模反向折叠。

蛋白质序列的自监督预训练已经在蛋白质功能和适应度预测方面取得了最先进的性能。然而，纯序列方法忽略了实验和预测蛋白质结构中包含的丰富信息。同时，反向折叠方法根据蛋白质的结构重建蛋白质的氨基酸序列，但不利用没有已知结构的序列。在这项研究中，我们训练了一个参数化为结构化图神经网络的掩蔽反折叠蛋白质掩蔽语言模型。在预训练过程中，该模型学习以骨干结构为条件重建受损序列。然后，我们表明，使用来自预训练的仅序列蛋白质掩蔽语言模型的输出作为反向折叠模型的输入，进一步改善了预训练的困惑。我们在下游蛋白质工程任务中评估了这两个模型，并分析了使用实验或预测结构的信息对性能的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Protein Engineering Design & Selection 生物-生化与分子生物学

CiteScore

3.30

自引率

4.20%

发文量

审稿时长

6-12 weeks

期刊介绍： Protein Engineering, Design and Selection (PEDS) publishes high-quality research papers and review articles relevant to the engineering, design and selection of proteins for use in biotechnology and therapy, and for understanding the fundamental link between protein sequence, structure, dynamics, function, and evolution.