用于酶-化学相互作用预测的描述符增强型机器学习

IF 4.4 2区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY Synthetic and Systems Biotechnology Pub Date : 2024-02-28 DOI:10.1016/j.synbio.2024.02.006

Yilei Han , Haoye Zhang , Zheni Zeng , Zhiyuan Liu , Diannan Lu , Zheng Liu

{"title":"用于酶-化学相互作用预测的描述符增强型机器学习","authors":"Yilei Han , Haoye Zhang , Zheni Zeng , Zhiyuan Liu , Diannan Lu , Zheng Liu","doi":"10.1016/j.synbio.2024.02.006","DOIUrl":null,"url":null,"abstract":"<div><p>Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.</p></div>","PeriodicalId":22148,"journal":{"name":"Synthetic and Systems Biotechnology","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2405805X24000310/pdfft?md5=5d970313733699b9c70bc8ff782aba8a&pid=1-s2.0-S2405805X24000310-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Descriptor-augmented machine learning for enzyme-chemical interaction predictions\",\"authors\":\"Yilei Han , Haoye Zhang , Zheni Zeng , Zhiyuan Liu , Diannan Lu , Zheng Liu\",\"doi\":\"10.1016/j.synbio.2024.02.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.</p></div>\",\"PeriodicalId\":22148,\"journal\":{\"name\":\"Synthetic and Systems Biotechnology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2405805X24000310/pdfft?md5=5d970313733699b9c70bc8ff782aba8a&pid=1-s2.0-S2405805X24000310-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Synthetic and Systems Biotechnology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2405805X24000310\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthetic and Systems Biotechnology","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405805X24000310","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

描述符可以从物理化学和进化的角度描述酶和化学物质的特征，因此在设计酶以更绿色地合成生化产品方面发挥着关键作用。本研究考察了各种描述符对用于酶化学关系预测的随机森林模型性能的影响。我们从文献中收集了七个特定酶家族的活性数据，并开发了使用 10 倍交叉验证评估机器学习模型性能的管道。在三种情况下评估了蛋白质和化学描述符的影响，即预测已知酶和已知化学物质之间未知关系的活性（新关系评估）、预测新型酶对已知化学物质的活性（新酶评估）和预测新化学物质对已知酶的活性（新化学评估）。结果表明，在酶数量最多的七个数据集中，蛋白质描述符在三个数据集上显著提高了模型对新酶评价的分类性能，而化学描述符则没有影响。研究人员构建了多种基于序列和结构的蛋白质描述符，其中 esm-2 描述符的效果最好。用酶家族作为标签表明，描述符可以很好地聚类蛋白质，这可以解释描述符对机器学习模型的贡献。与此相对应的是，在新化学评估中，化学描述符在七个数据集中的四个数据集上取得了显著改善，而蛋白质描述符则没有任何效果。我们试图通过将数据集的统计数据与模型的性能相关联来评估模型的泛化能力。结果表明，序列相似性较高的数据集更有可能在新酶评价中获得更好的结果，而酶数量较多的数据集则更有可能从蛋白质描述符策略中获益。这项工作为开发特定酶家族的机器学习模型提供了指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Descriptor-augmented machine learning for enzyme-chemical interaction predictions

Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Synthetic and Systems Biotechnology BIOTECHNOLOGY & APPLIED MICROBIOLOGY-

CiteScore

6.90

自引率

12.50%

发文量

审稿时长

67 days

期刊介绍： Synthetic and Systems Biotechnology aims to promote the communication of original research in synthetic and systems biology, with strong emphasis on applications towards biotechnology. This journal is a quarterly peer-reviewed journal led by Editor-in-Chief Lixin Zhang. The journal publishes high-quality research; focusing on integrative approaches to enable the understanding and design of biological systems, and research to develop the application of systems and synthetic biology to natural systems. This journal will publish Articles, Short notes, Methods, Mini Reviews, Commentary and Conference reviews.