Use of tree-based machine learning methods to screen affinitive peptides based on docking data.

IF 2.8 4区 医学 Q3 CHEMISTRY, MEDICINAL Molecular Informatics Pub Date : 2023-12-01 Epub Date: 2023-11-09 DOI:10.1002/minf.202300143
Hua Feng, Fangyu Wang, Ning Li, Qian Xu, Guanming Zheng, Xuefeng Sun, Man Hu, Xuewu Li, Guangxu Xing, Gaiping Zhang
{"title":"Use of tree-based machine learning methods to screen affinitive peptides based on docking data.","authors":"Hua Feng, Fangyu Wang, Ning Li, Qian Xu, Guanming Zheng, Xuefeng Sun, Man Hu, Xuewu Li, Guangxu Xing, Gaiping Zhang","doi":"10.1002/minf.202300143","DOIUrl":null,"url":null,"abstract":"<p><p>Screening peptides with good affinity is an important step in peptide-drug discovery. Recent advancement in computer and data science have made machine learning a useful tool in accurately affinitive-peptide screening. In current study, four different tree-based algorithms, including Classification and regression trees (CART), C5.0 decision tree (C50), Bagged CART (BAG) and Random Forest (RF), were employed to explore the relationship between experimental peptide affinities and virtual docking data, and the performance of each model was also compared in parallel. All four algorithms showed better performances on dataset pre-scaled, -centered and -PCA than other pre-processed dataset. After model re-built and hyperparameter optimization, the optimal C50 model (C50O) showed the best performances in terms of Accuracy, Kappa, Sensitivity, Specificity, F1, MCC and AUC when validated on test data and an unknown PEDV datasets evaluation (Accuracy=80.4 %). BAG and RFO (the optimal RF), as two best models during training process, did not performed as expecting during in testing and unknown dataset validations. Furthermore, the high correlation of the predictions of RFO and BAG to C50O implied the high stability and robustness of their prediction. Whereas although the good performance on unknown dataset, the poor performance in test data validation and correlation analysis indicated CARTO could not be used for future data prediction. To accurately evaluate the peptide affinity, the current study firstly gave a tree-model competition on affinitive peptide prediction by using virtual docking data, which would expand the application of machine learning algorithms in studying PepPIs and benefit the development of peptide therapeutics.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202300143"},"PeriodicalIF":2.8000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.202300143","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/9 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0

Abstract

Screening peptides with good affinity is an important step in peptide-drug discovery. Recent advancement in computer and data science have made machine learning a useful tool in accurately affinitive-peptide screening. In current study, four different tree-based algorithms, including Classification and regression trees (CART), C5.0 decision tree (C50), Bagged CART (BAG) and Random Forest (RF), were employed to explore the relationship between experimental peptide affinities and virtual docking data, and the performance of each model was also compared in parallel. All four algorithms showed better performances on dataset pre-scaled, -centered and -PCA than other pre-processed dataset. After model re-built and hyperparameter optimization, the optimal C50 model (C50O) showed the best performances in terms of Accuracy, Kappa, Sensitivity, Specificity, F1, MCC and AUC when validated on test data and an unknown PEDV datasets evaluation (Accuracy=80.4 %). BAG and RFO (the optimal RF), as two best models during training process, did not performed as expecting during in testing and unknown dataset validations. Furthermore, the high correlation of the predictions of RFO and BAG to C50O implied the high stability and robustness of their prediction. Whereas although the good performance on unknown dataset, the poor performance in test data validation and correlation analysis indicated CARTO could not be used for future data prediction. To accurately evaluate the peptide affinity, the current study firstly gave a tree-model competition on affinitive peptide prediction by using virtual docking data, which would expand the application of machine learning algorithms in studying PepPIs and benefit the development of peptide therapeutics.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用基于树的机器学习方法筛选基于对接数据的亲和肽。
筛选具有良好亲和力的多肽是多肽药物开发的重要步骤。计算机和数据科学的最新进展使机器学习成为准确筛选亲和肽的有用工具。本研究采用分类与回归树(CART)、C5.0决策树(C50)、Bagged CART (BAG)和Random Forest (RF) 4种不同的基于树的算法,探讨实验肽亲和度与虚拟对接数据之间的关系,并并行比较各模型的性能。四种算法在数据集预缩放、中心化和主成分分析方面均表现出较好的性能。经过模型重建和超参数优化,最优C50模型(C50O)在测试数据验证和未知PEDV数据集评估中,在准确率、Kappa、灵敏度、特异性、F1、MCC和AUC方面表现最佳(准确率= 80.4%)。BAG和RFO(最优RF)作为训练过程中的两个最佳模型,在测试和未知数据集验证过程中表现不如预期。此外,RFO和BAG对C50O的预测具有较高的相关性,表明其预测具有较高的稳定性和鲁棒性。然而,尽管CARTO在未知数据上具有良好的性能,但在测试数据验证和相关性分析方面的性能较差,表明CARTO不能用于未来的数据预测。为了准确评估肽的亲和性,本研究首先利用虚拟对接数据对亲和肽预测进行了树模型竞争,这将扩大机器学习算法在PepPIs研究中的应用,有利于肽疗法的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Molecular Informatics
Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY
CiteScore
7.30
自引率
2.80%
发文量
70
审稿时长
3 months
期刊介绍: Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.
期刊最新文献
Extended Activity Cliffs-Driven Approaches on Data Splitting for the Study of Bioactivity Machine Learning Predictions. BIOMX-DB: A web application for the BIOFACQUIM natural product database. Chemoinformatics for corrosion science: Data-driven modeling of corrosion inhibition by organic molecules. My 50 Years with Chemoinformatics. Pathway-based prediction of the therapeutic effects and mode of action of custom-made multiherbal medicines.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1