通用语境化蛋白质嵌入在跨物种蛋白质功能预测中的作用。

IF 1.7 4区生物学 Q4 EVOLUTIONARY BIOLOGY Evolutionary Bioinformatics Pub Date : 2021-12-03 eCollection Date: 2021-01-01 DOI:10.1177/11769343211062608

Irene van den Bent, Stavros Makrodimitris, Marcel Reinders

{"title":"通用语境化蛋白质嵌入在跨物种蛋白质功能预测中的作用。","authors":"Irene van den Bent, Stavros Makrodimitris, Marcel Reinders","doi":"10.1177/11769343211062608","DOIUrl":null,"url":null,"abstract":"Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.","PeriodicalId":50472,"journal":{"name":"Evolutionary Bioinformatics","volume":"17 ","pages":"11769343211062608"},"PeriodicalIF":1.7000,"publicationDate":"2021-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8647222/pdf/","citationCount":"0","resultStr":"{\"title\":\"The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.\",\"authors\":\"Irene van den Bent, Stavros Makrodimitris, Marcel Reinders\",\"doi\":\"10.1177/11769343211062608\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.\",\"PeriodicalId\":50472,\"journal\":{\"name\":\"Evolutionary Bioinformatics\",\"volume\":\"17 \",\"pages\":\"11769343211062608\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2021-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8647222/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Evolutionary Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1177/11769343211062608\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q4\",\"JCRName\":\"EVOLUTIONARY BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1177/11769343211062608","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

计算标注蛋白质的分子功能是一个困难的问题，由于可用的标注蛋白质训练数据量有限，这个问题变得更加困难。无监督蛋白质嵌入通过从大量无标记序列中学习通用蛋白质表示，部分地规避了这一限制。这种嵌入结合了氨基酸的上下文信息，从而模拟了对物种上下文不敏感的蛋白质序列的基本原理。我们使用了一种现有的预训练蛋白质嵌入方法，并对其分子功能预测性能进行了详细的鉴定，首先是为了加深对蛋白质语言模型的理解，其次是为了确定需要改进的地方。然后，我们将该模型应用于迁移学习任务中，根据一个训练物种的注释蛋白质序列的嵌入训练功能预测器，并对进化距离不同的多个测试物种的蛋白质进行预测。我们的研究表明，这种方法成功地将一个真核生物物种的蛋白质功能知识推广到了其他各种物种，其表现优于基于比对和基于监督学习的基线方法。这意味着，这种方法可以有效地对未充分研究的分类王国中注释不足的物种进行分子功能预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Evolutionary Bioinformatics 生物-进化生物学

CiteScore

4.20

自引率

0.00%

发文量

审稿时长

12 months

期刊介绍： Evolutionary Bioinformatics is an open access, peer reviewed international journal focusing on evolutionary bioinformatics. The journal aims to support understanding of organismal form and function through use of molecular, genetic, genomic and proteomic data by giving due consideration to its evolutionary context.