Haiyang Ye, Yunyi Zhang, Zilong Li, Yue Peng, Peng Zhou
{"title":"多肽定量结构-活性关系中高斯过程(GP)建模应用的综合评估与系统比较","authors":"Haiyang Ye, Yunyi Zhang, Zilong Li, Yue Peng, Peng Zhou","doi":"10.1016/j.chemolab.2024.105191","DOIUrl":null,"url":null,"abstract":"<div><p>Peptide quantitative structure-activity relationship (pQSAR) is a specific extension of traditional QSARs from small-molecule drugs to bioactive peptides. Since peptides are linear biopolymers that are essentially different to small-molecule compounds in terms of their structural features such as ordering sequence, large size and intrinsic flexibility, the pQSAR methodology (including structural characterization and regression modelling) should be further exploited relative to traditional QSARs. Gaussian process (GP) serves as a pioneering Bayesian-based machine learning (ML) solution for tackling linear/nonlinear-hybrid regression issues in intricate domains. However, the applications of GP regression in QSAR and, particularly, the pQSAR still remain largely unexplored to date. In this work, we launched a comprehensive pQSAR study with GP regression modelling, aiming to the deep evaluation of GP performance based on different characterizations and also the systematic comparison of GP with other routine MLs. Here, we culled two distinct classes of peptide datasets, which separately comprise 12 panels of sophisticated benchmarks and 46 panels of extended samples, totally containing 8804 peptide samples and systematically resulting in 522 regression models. Our study indicated that the GP can generally provide an effective solution for many pQSAR problems with the potential to promote ML regression modelling in this area, which is comparable with or even better than those widely used methods on both the sophisticated benchmarks and extended samples. In addition, GP also has many advantages as compared to traditional MLs, such as hyperparameter self-consistency, overfitting resistance, interpretable output and estimable uncertainty.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"252 ","pages":"Article 105191"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comprehensive evaluation and systematic comparison of Gaussian process (GP) modelling applications in peptide quantitative structure-activity relationship\",\"authors\":\"Haiyang Ye, Yunyi Zhang, Zilong Li, Yue Peng, Peng Zhou\",\"doi\":\"10.1016/j.chemolab.2024.105191\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Peptide quantitative structure-activity relationship (pQSAR) is a specific extension of traditional QSARs from small-molecule drugs to bioactive peptides. Since peptides are linear biopolymers that are essentially different to small-molecule compounds in terms of their structural features such as ordering sequence, large size and intrinsic flexibility, the pQSAR methodology (including structural characterization and regression modelling) should be further exploited relative to traditional QSARs. Gaussian process (GP) serves as a pioneering Bayesian-based machine learning (ML) solution for tackling linear/nonlinear-hybrid regression issues in intricate domains. However, the applications of GP regression in QSAR and, particularly, the pQSAR still remain largely unexplored to date. In this work, we launched a comprehensive pQSAR study with GP regression modelling, aiming to the deep evaluation of GP performance based on different characterizations and also the systematic comparison of GP with other routine MLs. Here, we culled two distinct classes of peptide datasets, which separately comprise 12 panels of sophisticated benchmarks and 46 panels of extended samples, totally containing 8804 peptide samples and systematically resulting in 522 regression models. Our study indicated that the GP can generally provide an effective solution for many pQSAR problems with the potential to promote ML regression modelling in this area, which is comparable with or even better than those widely used methods on both the sophisticated benchmarks and extended samples. In addition, GP also has many advantages as compared to traditional MLs, such as hyperparameter self-consistency, overfitting resistance, interpretable output and estimable uncertainty.</p></div>\",\"PeriodicalId\":9774,\"journal\":{\"name\":\"Chemometrics and Intelligent Laboratory Systems\",\"volume\":\"252 \",\"pages\":\"Article 105191\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemometrics and Intelligent Laboratory Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016974392400131X\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016974392400131X","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
肽定量结构-活性关系(pQSAR)是传统 QSAR 方法从小分子药物到生物活性肽的具体延伸。由于肽是线性生物聚合物,其结构特征(如排序序列、大尺寸和内在灵活性)与小分子化合物有本质区别,因此相对于传统 QSAR,pQSAR 方法(包括结构表征和回归建模)应得到进一步开发。高斯过程(GP)是一种开创性的基于贝叶斯的机器学习(ML)解决方案,用于解决复杂领域的线性/非线性混合回归问题。然而,迄今为止,GP 回归在 QSAR,尤其是 pQSAR 中的应用在很大程度上仍未得到探索。在这项工作中,我们利用 GP 回归建模开展了一项全面的 pQSAR 研究,旨在根据不同的特征对 GP 性能进行深入评估,并将 GP 与其他常规 ML 进行系统比较。在这里,我们选取了两类不同的肽数据集,分别包括 12 组精密基准和 46 组扩展样本,共包含 8804 个肽样本,并系统地生成了 522 个回归模型。我们的研究表明,GP 通常能为许多 pQSAR 问题提供有效的解决方案,具有促进该领域 ML 回归建模的潜力,在复杂基准和扩展样本上与那些广泛使用的方法不相上下,甚至更胜一筹。此外,与传统的 ML 相比,GP 还有很多优势,如超参数自洽性、抗过拟合、可解释的输出和可估计的不确定性。
Comprehensive evaluation and systematic comparison of Gaussian process (GP) modelling applications in peptide quantitative structure-activity relationship
Peptide quantitative structure-activity relationship (pQSAR) is a specific extension of traditional QSARs from small-molecule drugs to bioactive peptides. Since peptides are linear biopolymers that are essentially different to small-molecule compounds in terms of their structural features such as ordering sequence, large size and intrinsic flexibility, the pQSAR methodology (including structural characterization and regression modelling) should be further exploited relative to traditional QSARs. Gaussian process (GP) serves as a pioneering Bayesian-based machine learning (ML) solution for tackling linear/nonlinear-hybrid regression issues in intricate domains. However, the applications of GP regression in QSAR and, particularly, the pQSAR still remain largely unexplored to date. In this work, we launched a comprehensive pQSAR study with GP regression modelling, aiming to the deep evaluation of GP performance based on different characterizations and also the systematic comparison of GP with other routine MLs. Here, we culled two distinct classes of peptide datasets, which separately comprise 12 panels of sophisticated benchmarks and 46 panels of extended samples, totally containing 8804 peptide samples and systematically resulting in 522 regression models. Our study indicated that the GP can generally provide an effective solution for many pQSAR problems with the potential to promote ML regression modelling in this area, which is comparable with or even better than those widely used methods on both the sophisticated benchmarks and extended samples. In addition, GP also has many advantages as compared to traditional MLs, such as hyperparameter self-consistency, overfitting resistance, interpretable output and estimable uncertainty.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.