Comparative Analysis of Deep Generative Model for Industrial Enzyme Design

IF 2.9 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS Current Bioinformatics Pub Date : 2024-04-16 DOI:10.2174/0115748936303223240404043202

Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo

{"title":"Comparative Analysis of Deep Generative Model for Industrial Enzyme Design","authors":"Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo","doi":"10.2174/0115748936303223240404043202","DOIUrl":null,"url":null,"abstract":": Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"35 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936303223240404043202","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

: Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于工业酶设计的深度生成模型对比分析

:虽然酶具有高效催化的优势，但天然酶在工业环境中缺乏稳定性，甚至无法满足所需的催化反应。这促使我们急需从头设计新的酶。计算设计是一种强大的工具，可以快速有效地探索序列空间，促进设计出适合特定条件和要求的新型酶。利用计算方法重新设计工业酶是有益的。目前，只有一种明确为酶生成而设计的工具表现不尽如人意。我们选择了几种通用的蛋白质序列设计工具，并系统地评估了它们应用于特定工业酶的效果。我们调查了与蛋白质生成相关的文献。我们将用于序列生成的计算方法归纳为三类：有结构条件的序列生成、无结构约束的序列生成以及序列和结构的共同生成。为了有效评估六种计算工具生成酶序列的能力，我们首先构建了一个名为 Luc_64 的荧光素酶数据集。然后，我们评估了这些方法在该数据集上生成的酶序列的质量，包括氨基酸分布、EC编号验证等。我们还从序列和结构的角度，使用序列恢复率和均方根偏差（RMSD）评估了基于结构的方法在现有公共数据集上生成的序列。在功能性数据集中，Luc_64、ABACUS-R 和 ProteinMPNN 所生成的序列的氨基酸分布和功能与天然荧光素酶的氨基酸分布和功能非常接近，这表明它们能有效保留酶的基本特征。在这两个基准数据集中，ABACUS-R 和 ProteinMPNN 还表现出最高的序列恢复率，这表明它们具有生成与原始酶结构非常相似的序列的卓越能力。我们的研究为研究人员选择合适的酶序列设计工具提供了重要参考，突出了每种工具在生成准确和功能性酶序列方面的优势和局限性。在我们的评估中，ProteinMPNN 和 ABACUS-R 成为最有效的工具，它们在序列恢复和 RMSD 方面具有很高的准确性，并通过精确的氨基酸分布保持了酶功能的完整性。同时，在特定工业酶基准上对蛋白质通用工具迁移到特定工业酶的性能进行了公平评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Current Bioinformatics 生物-生化研究方法

CiteScore

6.60

自引率

2.50%

发文量

审稿时长

>12 weeks

期刊介绍： Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science. The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.