ProGen2: Exploring the boundaries of protein language models.

Cell systems Pub Date : 2023-11-15 Epub Date: 2023-10-30 DOI:10.1016/j.cels.2023.10.002

Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, Ali Madani

{"title":"ProGen2: Exploring the boundaries of protein language models.","authors":"Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, Ali Madani","doi":"10.1016/j.cels.2023.10.002","DOIUrl":null,"url":null,"abstract":"<p><p>Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cell systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.cels.2023.10.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/10/30 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ProGen2：探索蛋白质语言模型的边界。

在蛋白质序列上训练的基于注意力的模型在与人工智能驱动的蛋白质设计相关的分类和生成任务方面取得了令人难以置信的成功。然而，我们对非常大规模的模型和数据如何在有效的蛋白质模型开发中发挥作用缺乏足够的了解。我们介绍了一套名为ProGen2的蛋白质语言模型，这些模型被放大到6.4B的参数，并在不同的序列数据集上进行训练，这些数据集来自基因组、宏基因组和免疫库数据库中的10多亿个蛋白质。ProGen2模型在捕捉观察到的进化序列的分布、生成新的可行序列和预测蛋白质适合度方面显示出最先进的性能，而无需额外的微调。随着大模型大小和蛋白质序列的原始数量继续变得更容易获得，我们的结果表明，需要越来越重视提供给蛋白质序列模型的数据分布。我们的模型和代码是开源的，可在蛋白质工程中广泛采用。本文的透明同行评审过程记录包含在补充信息中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Cell systems

自引率

0.00%

发文量