首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles 全面比较基于深度学习的化合物-目标相互作用预测模型,揭示指导性设计原则。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-28 DOI: 10.1186/s13321-024-00913-1
Sina Abdollahi, Darius P. Schaub, Madalena Barroso, Nora C. Laubach, Wiebke Hutwelker, Ulf Panzer, S.øren W. Gersting, Stefan Bonn

The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.

Scientific contribution

This work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.

评估化合物-靶标相互作用(CTIs)是药物发现工作的核心。鉴于经典实验筛选需要花费大量的时间和金钱,人们一直致力于开发能准确预测 CTIs 的基于深度学习的模型。然而,目前还缺乏对这些模型在大型、经过策划的 CTI 数据集上的全面比较。在此,我们对使用不同蛋白质和化合物表征的 12 种最先进的深度学习架构进行了深入比较。这些模型是根据其报告的性能和架构筛选出来的。为了可靠地比较模型性能,我们整理了 30 多万个结合和非结合 CTI,并建立了几个不同规模和信息的黄金标准数据集。根据我们的研究结果,在大多数数据集上,DeepConv-DTI 的 CTI 预测性能始终优于其他模型。在大多数数据集上,它的 MCC 达到 0.6 或更高,是训练和推理速度最快的模型之一。这些结果表明,利用 DeepConv-DTI 中基于卷积的窗口来遍历可训练嵌入是捕捉蛋白质信息特征的一种非常有效的方法。我们还观察到,目标的物理化学嵌入提高了模型性能。因此,我们对 DeepConv-DTI 进行了修改,加入了归一化的物理化学特性,从而产生了整体性能最佳的模型 Phys-DeepConv-DTI。这项工作凸显了对化合物和目标的输入特征及其相应的神经网络架构进行系统评估,可作为未来开发改进型 CTI 模型的路线图。基于该数据集,我们深入了解了哪些化合物和靶标的嵌入以及哪些基于深度学习的算法表现最佳,为 CTI 算法的未来发展提供了蓝图。利用从这一筛选中获得的洞察力,我们提供了一种具有最先进性能的新型 CTI 算法。
{"title":"A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles","authors":"Sina Abdollahi,&nbsp;Darius P. Schaub,&nbsp;Madalena Barroso,&nbsp;Nora C. Laubach,&nbsp;Wiebke Hutwelker,&nbsp;Ulf Panzer,&nbsp;S.øren W. Gersting,&nbsp;Stefan Bonn","doi":"10.1186/s13321-024-00913-1","DOIUrl":"10.1186/s13321-024-00913-1","url":null,"abstract":"<div><p>The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.</p><p><b>Scientific contribution</b></p><p>This work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00913-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning 利用机器学习预测不同温度下药物在二元溶剂混合物中的溶解度
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-28 DOI: 10.1186/s13321-024-00911-3
Zeqing Bao, Gary Tom, Austin Cheng, Jeffrey Watchorn, Alán Aspuru-Guzik, Christine Allen

Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE < 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available.

Scientific contribution

Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.

药物溶解度是药物开发过程中的一个重要参数,但其测量通常既繁琐又具有挑战性,尤其是对于昂贵药物或小剂量药物。为了缓解这些挑战,机器学习(ML)作为一种替代方法被应用于预测药物溶解度。然而,现有的大多数 ML 研究都侧重于预测水溶性和/或在特定温度下的溶解性,这限制了模型在药物开发中的适用性。为了弥补这一不足,我们汇编了一个包含 27,000 个溶解度数据点的数据集,其中包括在各种温度下一系列二元溶剂混合物中测得的小分子溶解度。接下来,一组 ML 模型在该数据集上进行了训练,并使用贝叶斯优化方法对其超参数进行了调整。结果表明,性能最好的模型是梯度提升决策树(轻梯度提升机和极梯度提升),在保留集上 LogS(S,单位 g/100 g)的平均绝对误差 (MAE) 为 0.33。通过一项前瞻性研究对这些模型进行了进一步验证,在这项研究中,模型预测了四种药物分子的溶解度,然后用内部溶解度实验进行了验证。这项前瞻性研究表明,模型准确预测了不同温度下溶质在特定二元溶剂混合物中的溶解度,特别是对于数据集中溶质特征非常接近的药物(LogS 的 MAE < 0.5)。为了支持未来的研究并促进该领域的进步,我们公开了数据集和代码。科学贡献 我们的研究通过利用 ML 和独特的综合数据集,推动了小分子溶解度预测领域的最新发展。现有的 ML 研究主要关注固定温度下水溶液中的溶解度,与此不同,我们的工作能够在广泛的温度范围内预测药物在各种二元溶剂混合物中的溶解度,为现实的制药应用提供了实用的溶解度建模见解。这些进展以及开放访问的数据集和代码支持药物开发过程中的重要步骤,包括新分子发现、药物分析和制剂。
{"title":"Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning","authors":"Zeqing Bao,&nbsp;Gary Tom,&nbsp;Austin Cheng,&nbsp;Jeffrey Watchorn,&nbsp;Alán Aspuru-Guzik,&nbsp;Christine Allen","doi":"10.1186/s13321-024-00911-3","DOIUrl":"10.1186/s13321-024-00911-3","url":null,"abstract":"<p>Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE &lt; 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available.</p><p><b>Scientific contribution</b></p><p>Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00911-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142519918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization 分子的图神经过程:对接得分评估和提高通用性的策略
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-23 DOI: 10.1186/s13321-024-00904-2
Miguel García-Ortegón, Srijit Seal, Carl Rasmussen, Andreas Bender, Sergio Bacallado

Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting.

Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.

神经过程(NP)是一种元学习模型,可输出不确定性估计值。迄今为止,大多数关于 NP 的研究都集中在高度相关任务的低维数据集上。虽然这些同质数据集有助于制定基准,但它们可能并不能代表现实的迁移学习。特别是,由于元测试任务的潜在新颖性,科学研究中的应用可能证明特别具有挑战性。分子性质预测就是这样一个研究领域,其特点是共享分子空间上许多函数的稀疏数据集。在本文中,我们利用 DOCKSTRING(一个多样化的对接得分数据集)研究了图 NP 在分子性质预测中的应用。与化学信息学中常见的监督学习基线以及迁移学习和元学习的替代技术相比,图 NPs 在少量学习任务中表现出了极具竞争力的性能。为了提高对不同测试函数的元泛化能力,我们提出了调整 NPs 参数的微调策略。我们发现,调整可以大幅提高 NPs 的回归性能,同时保持不确定性估计的良好校准。最后,我们介绍了一个贝叶斯优化实验,该实验展示了 NPs 在迭代筛选中相对于高斯过程的潜在优势。总之,我们的研究结果表明,分子图上的神经过程在低数据环境下的分子性质预测方面具有巨大潜力。神经过程是元学习算法的一个系列,它通过跨任务传递信息和进行概率预测来应对数据稀缺问题。我们利用对接得分评估了它们在回归和优化分子任务上的性能,发现它们优于经典的单一任务和迁移学习模型。我们研究了元学习算法在科学领域普遍关注的对不同测试任务的泛化问题,并提出了缓解这一问题的策略。
{"title":"Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization","authors":"Miguel García-Ortegón,&nbsp;Srijit Seal,&nbsp;Carl Rasmussen,&nbsp;Andreas Bender,&nbsp;Sergio Bacallado","doi":"10.1186/s13321-024-00904-2","DOIUrl":"10.1186/s13321-024-00904-2","url":null,"abstract":"<p>Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting.</p><p>Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00904-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MEF-AlloSite: an accurate and robust Multimodel Ensemble Feature selection for the Allosteric Site identification model MEF-AlloSite:针对异位基因位点识别模型的精确、稳健的多模型集合特征选择
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-23 DOI: 10.1186/s13321-024-00882-5
Sadettin Y. Ugurlu, David McDonald, Shan He

A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.

Scientific Contribution

Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid–based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values ((< 0.05)) and the majority of Cohen’s D values ((> 0.5)) showed that MEF-AlloSite’s 1–6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.

控制蛋白质作用的一个重要机制是异构。与正表型配体相比,异位调节剂有可能带来许多好处,例如提高选择性和效应饱和度。鉴定新的异构位点为开发创新药物提供了前景,并加深了我们对基本生物机制的理解。通过机器学习应用等各种技术,我们在不同的蛋白质家族中发现了越来越多的异构位点,这为创造具有多种化学结构的全新药物提供了可能性。机器学习方法(如 PASSer)在仅依靠三维结构信息准确找到异构结合位点方面的功效有限。科学贡献 在进行异生结合位点识别的特征选择之前,将基于氨基酸的支持信息与三维结构知识进行整合是非常有利的。这种方法可以确保准确性和稳健性,从而提高性能。因此,我们从文献中收集了9460个相关的不同特征来表征口袋,然后开发了一个准确而稳健的模型,称为 "用于异生结合位点识别的多模型集合特征选择(MEF-AlloSite)"。该模型针对仅有 90 个蛋白质的小型训练集,采用了精确、稳健的多模式特征选择技术,以提高预测性能。这种最先进的技术从 9460 个特征中筛选出了有希望的特征,从而提高了异生结合位点识别的性能。此外,通过分析所选特征与异构结合位点之间的关系,还有助于理解复杂的蛋白质异构。MEF-AlloSite 与 PASSer2.0 和 PASSerRank 等最先进的异构位点识别方法在三个测试用例上进行了 51 次测试,并对训练集进行了不同的拆分。采用学生 t 检验和 Cohen's D 值来评估平均精度和 ROC AUC 分数分布。在三个测试案例中,大多数 p 值($$< 0.05$$)和大多数 Cohen's D 值($$> 0.5$$)都表明,MEF-AlloSite 的平均精确度和 ROC AUC 平均值比最先进的异构位点识别方法高 1-6%,具有显著的统计学意义。
{"title":"MEF-AlloSite: an accurate and robust Multimodel Ensemble Feature selection for the Allosteric Site identification model","authors":"Sadettin Y. Ugurlu,&nbsp;David McDonald,&nbsp;Shan He","doi":"10.1186/s13321-024-00882-5","DOIUrl":"10.1186/s13321-024-00882-5","url":null,"abstract":"<div><p>A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.</p><p><b>Scientific Contribution</b></p><p>Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid–based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values (<span>(&lt; 0.05)</span>) and the majority of Cohen’s D values (<span>(&gt; 0.5)</span>) showed that MEF-AlloSite’s 1–6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00882-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale annotation of biochemically relevant pockets and tunnels in cognate enzyme–ligand complexes 大规模注释同源酶配体中的生化相关口袋和隧道
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-15 DOI: 10.1186/s13321-024-00907-z
O. Vavra, J. Tyzack, F. Haddadi, J. Stourac, J. Damborsky, S. Mazurenko, J. M. Thornton, D. Bednar

Tunnels in enzymes with buried active sites are key structural features allowing the entry of substrates and the release of products, thus contributing to the catalytic efficiency. Targeting the bottlenecks of protein tunnels is also a powerful protein engineering strategy. However, the identification of functional tunnels in multiple protein structures is a non-trivial task that can only be addressed computationally. We present a pipeline integrating automated structural analysis with an in-house machine-learning predictor for the annotation of protein pockets, followed by the calculation of the energetics of ligand transport via biochemically relevant tunnels. A thorough validation using eight distinct molecular systems revealed that CaverDock analysis of ligand un/binding is on par with time-consuming molecular dynamics simulations, but much faster. The optimized and validated pipeline was applied to annotate more than 17,000 cognate enzyme–ligand complexes. Analysis of ligand un/binding energetics indicates that the top priority tunnel has the most favourable energies in 75% of cases. Moreover, energy profiles of cognate ligands revealed that a simple geometry analysis can correctly identify tunnel bottlenecks only in 50% of cases. Our study provides essential information for the interpretation of results from tunnel calculation and energy profiling in mechanistic enzymology and protein engineering. We formulated several simple rules allowing identification of biochemically relevant tunnels based on the binding pockets, tunnel geometry, and ligand transport energy profiles.

Scientific contributions

The pipeline introduced in this work allows for the detailed analysis of a large set of protein–ligand complexes, focusing on transport pathways. We are introducing a novel predictor for determining the relevance of binding pockets for tunnel calculation. For the first time in the field, we present a high-throughput energetic analysis of ligand binding and unbinding, showing that approximate methods for these simulations can identify additional mutagenesis hotspots in enzymes compared to purely geometrical methods. The predictor is included in the supplementary material and can also be accessed at https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git. The tunnel data calculated in this study has been made publicly available as part of the ChannelsDB 2.0 database, accessible at https://channelsdb2.biodata.ceitec.cz/.

具有埋藏活性位点的酶中的隧道是允许底物进入和产物释放的关键结构特征,因此有助于提高催化效率。瞄准蛋白质隧道的瓶颈也是一种强大的蛋白质工程策略。然而,在多个蛋白质结构中识别功能性隧道是一项非同小可的任务,只能通过计算来解决。我们介绍了一种集成了自动结构分析和内部机器学习预测器的管道,用于注释蛋白质口袋,然后计算配体通过生化相关隧道运输的能量。使用八个不同的分子系统进行的全面验证表明,CaverDock 对配体解除/结合的分析与耗时的分子动力学模拟相当,但速度更快。经过优化和验证的管道被用于注释 17,000 多个同源酶配体复合物。配体解除/结合能量分析表明,在 75% 的情况下,最优先隧道具有最有利的能量。此外,同源配体的能量曲线显示,简单的几何分析只能在 50% 的情况下正确识别隧道瓶颈。我们的研究为解释机理酶学和蛋白质工程中隧道计算和能量剖析的结果提供了重要信息。我们制定了几条简单的规则,允许根据结合口袋、隧道几何形状和配体运输能量曲线识别与生物化学相关的隧道。 科学贡献这项工作中引入的管道可对大量蛋白质配体复合物进行详细分析,重点关注运输途径。我们引入了一种新颖的预测方法,用于确定结合口袋与隧道计算的相关性。在这一领域,我们首次提出了配体结合和解除结合的高通量能量分析,表明与纯粹的几何方法相比,这些模拟的近似方法可以发现酶中更多的诱变热点。预测器包含在补充材料中,也可通过 https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git 访问。本研究中计算的隧道数据已作为 ChannelsDB 2.0 数据库的一部分公开发布,访问网址为 https://channelsdb2.biodata.ceitec.cz/。
{"title":"Large-scale annotation of biochemically relevant pockets and tunnels in cognate enzyme–ligand complexes","authors":"O. Vavra,&nbsp;J. Tyzack,&nbsp;F. Haddadi,&nbsp;J. Stourac,&nbsp;J. Damborsky,&nbsp;S. Mazurenko,&nbsp;J. M. Thornton,&nbsp;D. Bednar","doi":"10.1186/s13321-024-00907-z","DOIUrl":"10.1186/s13321-024-00907-z","url":null,"abstract":"<div><p>Tunnels in enzymes with buried active sites are key structural features allowing the entry of substrates and the release of products, thus contributing to the catalytic efficiency. Targeting the bottlenecks of protein tunnels is also a powerful protein engineering strategy. However, the identification of functional tunnels in multiple protein structures is a non-trivial task that can only be addressed computationally. We present a pipeline integrating automated structural analysis with an <i>in-house</i> machine-learning predictor for the annotation of protein pockets, followed by the calculation of the energetics of ligand transport via biochemically relevant tunnels. A thorough validation using eight distinct molecular systems revealed that CaverDock analysis of ligand un/binding is on par with time-consuming molecular dynamics simulations, but much faster. The optimized and validated pipeline was applied to annotate more than 17,000 cognate enzyme–ligand complexes. Analysis of ligand un/binding energetics indicates that the top priority tunnel has the most favourable energies in 75% of cases. Moreover, energy profiles of cognate ligands revealed that a simple geometry analysis can correctly identify tunnel bottlenecks only in 50% of cases. Our study provides essential information for the interpretation of results from tunnel calculation and energy profiling in mechanistic enzymology and protein engineering. We formulated several simple rules allowing identification of biochemically relevant tunnels based on the binding pockets, tunnel geometry, and ligand transport energy profiles.</p><p><b>Scientific contributions</b></p><p>The pipeline introduced in this work allows for the detailed analysis of a large set of protein–ligand complexes, focusing on transport pathways. We are introducing a novel predictor for determining the relevance of binding pockets for tunnel calculation. For the first time in the field, we present a high-throughput energetic analysis of ligand binding and unbinding, showing that approximate methods for these simulations can identify additional mutagenesis hotspots in enzymes compared to purely geometrical methods. The predictor is included in the supplementary material and can also be accessed at https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git. The tunnel data calculated in this study has been made publicly available as part of the ChannelsDB 2.0 database, accessible at https://channelsdb2.biodata.ceitec.cz/.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00907-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bitter peptide prediction using graph neural networks 利用图神经网络预测苦味肽
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-07 DOI: 10.1186/s13321-024-00909-x
Prashant Srivastava, Alexandra Steuer, Francesco Ferri, Alessandro Nicoli, Kristian Schultz, Saptarshi Bej, Antonella Di Pizio, Olaf Wolkenhauer

Bitter taste is an unpleasant taste modality that affects food consumption. Bitter peptides are generated during enzymatic processes that produce functional, bioactive protein hydrolysates or during the aging process of fermented products such as cheese, soybean protein, and wine. Understanding the underlying peptide sequences responsible for bitter taste can pave the way for more efficient identification of these peptides. This paper presents BitterPep-GCN, a feature-agnostic graph convolution network for bitter peptide prediction. The graph-based model learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. BitterPep-GCN was benchmarked using BTP640, a publicly available bitter peptide dataset. The latent peptide embeddings generated by the trained model were used to analyze the activity of sequence motifs responsible for the bitter taste of the peptides. Particularly, we calculated the activity for individual amino acids and dipeptide, tripeptide, and tetrapeptide sequence motifs present in the peptides. Our analyses pinpoint specific amino acids, such as F, G, P, and R, as well as sequence motifs, notably tripeptide and tetrapeptide motifs containing FF, as key bitter signatures in peptides. This work not only provides a new predictor of bitter taste for a more efficient identification of bitter peptides in various food products but also gives a hint into the molecular basis of bitterness.

Scientific Contribution

Our work provides the first application of Graph Neural Networks for the prediction of peptide bitter taste. The best-developed model, BitterPep-GCN, learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. The embeddings were used to analyze the sequence motifs responsible for the bitter taste.

苦味是一种影响食物消费的令人不快的味觉模式。苦味肽是在产生功能性生物活性蛋白质水解物的酶解过程中,或在奶酪、大豆蛋白和葡萄酒等发酵产品的陈酿过程中产生的。了解造成苦味的基本肽序列可以为更有效地鉴定这些肽铺平道路。本文介绍了用于苦味肽预测的特征识别图卷积网络 BitterPep-GCN。该基于图的模型可学习苦味肽序列中氨基酸的嵌入,并使用混合池法进行苦味分类。BitterPep-GCN 利用公开的苦味肽数据集 BTP640 进行了基准测试。训练模型生成的潜在肽嵌入被用来分析造成肽苦味的序列主题的活性。特别是,我们计算了肽中存在的单个氨基酸以及二肽、三肽和四肽序列主题的活性。分析结果表明,特定氨基酸(如 F、G、P 和 R)和序列基序(尤其是含有 FF 的三肽和四肽基序)是多肽中主要的苦味特征。这项工作不仅为更有效地识别各种食品中的苦味肽提供了一种新的苦味预测指标,还为苦味的分子基础提供了线索。科学贡献 我们的研究首次将图神经网络应用于肽苦味的预测。开发的最佳模型 BitterPep-GCN 学习苦味肽序列中氨基酸的嵌入,并使用混合池进行苦味分类。嵌入被用来分析造成苦味的序列主题。
{"title":"Bitter peptide prediction using graph neural networks","authors":"Prashant Srivastava,&nbsp;Alexandra Steuer,&nbsp;Francesco Ferri,&nbsp;Alessandro Nicoli,&nbsp;Kristian Schultz,&nbsp;Saptarshi Bej,&nbsp;Antonella Di Pizio,&nbsp;Olaf Wolkenhauer","doi":"10.1186/s13321-024-00909-x","DOIUrl":"10.1186/s13321-024-00909-x","url":null,"abstract":"<div><p>Bitter taste is an unpleasant taste modality that affects food consumption. Bitter peptides are generated during enzymatic processes that produce functional, bioactive protein hydrolysates or during the aging process of fermented products such as cheese, soybean protein, and wine. Understanding the underlying peptide sequences responsible for bitter taste can pave the way for more efficient identification of these peptides. This paper presents BitterPep-GCN, a feature-agnostic graph convolution network for bitter peptide prediction. The graph-based model learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. BitterPep-GCN was benchmarked using BTP640, a publicly available bitter peptide dataset. The latent peptide embeddings generated by the trained model were used to analyze the activity of sequence motifs responsible for the bitter taste of the peptides. Particularly, we calculated the activity for individual amino acids and dipeptide, tripeptide, and tetrapeptide sequence motifs present in the peptides. Our analyses pinpoint specific amino acids, such as F, G, P, and R, as well as sequence motifs, notably tripeptide and tetrapeptide motifs containing FF, as key bitter signatures in peptides. This work not only provides a new predictor of bitter taste for a more efficient identification of bitter peptides in various food products but also gives a hint into the molecular basis of bitterness.</p><p><b>Scientific Contribution</b></p><p>Our work provides the first application of Graph Neural Networks for the prediction of peptide bitter taste. The best-developed model, BitterPep-GCN, learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. The embeddings were used to analyze the sequence motifs responsible for the bitter taste.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00909-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer 对 PubChem 生物测定记录的数据挖掘揭示了作为卵巢癌潜在治疗药物的多种 OXPHOS 抑制性化学类型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-07 DOI: 10.1186/s13321-024-00906-0
Sejal Sharma, Liping Feng, Nicha Boonpattrawong, Arvinder Kapur, Lisa Barroilhet, Manish S. Patankar, Spencer S. Ericksen
<div><p>Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior screening data or a target structure. For phenotypic or multi-protein pathway targets, it may not be clear which public assay records provide relevant data. The question also arises as to whether data collected from disparate assays might be usefully consolidated. Here, we report on the development and application of a data mining pipeline to examine these issues. To illustrate, we focus on identifying inhibitors of oxidative phosphorylation, a druggable metabolic process in epithelial ovarian tumors. The pipeline compiled 8415 available OXPHOS-related bioassays in the PubChem data repository involving 312,093 unique compound records. Application of PubChem assay activity annotations, PAINS (Pan Assay Interference Compounds), and Lipinski-like bioavailability filters yields 1852 putative OXPHOS-active compounds that fall into 464 clusters. These chemotypes are diverse but have relatively high hydrophobicity and molecular weight but lower complexity and drug-likeness. These chemotypes show a high abundance of bicyclic ring systems and oxygen containing functional groups including ketones, allylic oxides (alpha/beta unsaturated carbonyls), hydroxyl groups, and ethers. In contrast, amide and primary amine functional groups have a notably lower than random prevalence. UMAP representation of the chemical space shows strong divergence in the regions occupied by OXPHOS-inactive and -active compounds. Of the six compounds selected for biological testing, 4 showed statistically significant inhibition of electron transport in bioenergetics assays. Two of these four compounds, lacidipine and esbiothrin, increased in intracellular oxygen radicals (a major hallmark of most OXPHOS inhibitors) and decreased the viability of two ovarian cancer cell lines, ID8 and OVCAR5. Finally, data from the pipeline were used to train random forest and support vector classifiers that effectively prioritized OXPHOS inhibitory compounds within a held-out test set (ROCAUC 0.962 and 0.927, respectively) and on another set containing 44 documented OXPHOS inhibitors outside of the training set (ROCAUC 0.900 and 0.823). This prototype pipeline is extensible and could be adapted for focus screening on other phenotypic targets for which sufficient public data are available.</p><p><b>Scientific contribution</b></p><p>Here, we describe and apply an assay data mining pipeline to compile, process, filter, and mine public bioassay data. We believe the procedure may be more broadly applied to guide compound selection in early-stage hit finding on novel multi-protein mechanistic or phenotypic targets. To demonstrate the utility of our approach, we apply a data mining strategy on a large set of public assay data to find drug-like molecules that inhibit oxidative phosphorylation (OXPHOS) a
对目标优先的化合物集进行重点筛选可以有效替代高通量筛选(HTS)。对于大多数生物分子靶点,化合物优先排序模型取决于先前的筛选数据或靶点结构。对于表型或多蛋白通路靶点,可能不清楚哪些公共检测记录提供了相关数据。另外一个问题是,从不同检测方法中收集的数据是否可以进行有用的整合。在此,我们报告了数据挖掘管道的开发和应用情况,以研究这些问题。为了说明这一点,我们重点研究了氧化磷酸化抑制剂的鉴定,氧化磷酸化是上皮性卵巢肿瘤中的一种药物代谢过程。该管道编译了 PubChem 数据库中 8415 种可用的氧化磷酸化相关生物检测方法,涉及 312,093 条独特的化合物记录。应用 PubChem 检测活性注释、PAINS(泛检测干扰化合物)和类似 Lipinski 的生物利用度过滤器,得出了 1852 种推测具有 OXPHOS 活性的化合物,可归入 464 个群组。这些化学类型多种多样,但疏水性和分子量相对较高,复杂性和药物相似性较低。这些化学类型中含有大量双环系统和含氧官能团,包括酮、烯丙基氧化物(α/β 不饱和羰基)、羟基和醚。相比之下,酰胺和伯胺官能团的含量明显低于随机含量。化学空间的 UMAP 表示法显示,OXPHOS 活性化合物和活性化合物占据的区域存在很大差异。在被选中进行生物测试的六种化合物中,有四种在生物能测定中对电子传递有显著的统计学抑制作用。这四种化合物中的两种,即拉西地平(lacidipine)和艾生菌素(esbiothrin),增加了细胞内氧自由基(大多数 OXPHOS 抑制剂的主要特征),降低了两种卵巢癌细胞系 ID8 和 OVCAR5 的存活率。最后,来自该管道的数据被用于训练随机森林和支持向量分类器,这些分类器能有效地在一个保留的测试集中优先选择 OXPHOS 抑制化合物(ROCAUC 分别为 0.962 和 0.927),并在另一个包含 44 种训练集以外的记录在案的 OXPHOS 抑制剂的测试集中优先选择 OXPHOS 抑制化合物(ROCAUC 分别为 0.900 和 0.823)。该原型管道具有可扩展性,可用于对其他有足够公开数据的表型靶标进行重点筛选。科学贡献 在这里,我们描述并应用了一种化验数据挖掘管道来编译、处理、过滤和挖掘公共生物化验数据。我们相信,该程序可以更广泛地应用于指导化合物的选择,从而在早期阶段发现新的多蛋白机理或表型靶点。为了证明我们的方法的实用性,我们在大量公共检测数据集上应用数据挖掘策略,寻找抑制氧化磷酸化(OXPHOS)的类药物分子,作为卵巢癌疗法的候选药物。
{"title":"Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer","authors":"Sejal Sharma,&nbsp;Liping Feng,&nbsp;Nicha Boonpattrawong,&nbsp;Arvinder Kapur,&nbsp;Lisa Barroilhet,&nbsp;Manish S. Patankar,&nbsp;Spencer S. Ericksen","doi":"10.1186/s13321-024-00906-0","DOIUrl":"10.1186/s13321-024-00906-0","url":null,"abstract":"&lt;div&gt;&lt;p&gt;Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior screening data or a target structure. For phenotypic or multi-protein pathway targets, it may not be clear which public assay records provide relevant data. The question also arises as to whether data collected from disparate assays might be usefully consolidated. Here, we report on the development and application of a data mining pipeline to examine these issues. To illustrate, we focus on identifying inhibitors of oxidative phosphorylation, a druggable metabolic process in epithelial ovarian tumors. The pipeline compiled 8415 available OXPHOS-related bioassays in the PubChem data repository involving 312,093 unique compound records. Application of PubChem assay activity annotations, PAINS (Pan Assay Interference Compounds), and Lipinski-like bioavailability filters yields 1852 putative OXPHOS-active compounds that fall into 464 clusters. These chemotypes are diverse but have relatively high hydrophobicity and molecular weight but lower complexity and drug-likeness. These chemotypes show a high abundance of bicyclic ring systems and oxygen containing functional groups including ketones, allylic oxides (alpha/beta unsaturated carbonyls), hydroxyl groups, and ethers. In contrast, amide and primary amine functional groups have a notably lower than random prevalence. UMAP representation of the chemical space shows strong divergence in the regions occupied by OXPHOS-inactive and -active compounds. Of the six compounds selected for biological testing, 4 showed statistically significant inhibition of electron transport in bioenergetics assays. Two of these four compounds, lacidipine and esbiothrin, increased in intracellular oxygen radicals (a major hallmark of most OXPHOS inhibitors) and decreased the viability of two ovarian cancer cell lines, ID8 and OVCAR5. Finally, data from the pipeline were used to train random forest and support vector classifiers that effectively prioritized OXPHOS inhibitory compounds within a held-out test set (ROCAUC 0.962 and 0.927, respectively) and on another set containing 44 documented OXPHOS inhibitors outside of the training set (ROCAUC 0.900 and 0.823). This prototype pipeline is extensible and could be adapted for focus screening on other phenotypic targets for which sufficient public data are available.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Scientific contribution&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Here, we describe and apply an assay data mining pipeline to compile, process, filter, and mine public bioassay data. We believe the procedure may be more broadly applied to guide compound selection in early-stage hit finding on novel multi-protein mechanistic or phenotypic targets. To demonstrate the utility of our approach, we apply a data mining strategy on a large set of public assay data to find drug-like molecules that inhibit oxidative phosphorylation (OXPHOS) a","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00906-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Insights into predicting small molecule retention times in liquid chromatography using deep learning 利用深度学习预测液相色谱中的小分子保留时间的启示
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-07 DOI: 10.1186/s13321-024-00905-1
Yuting Liu, Akiyasu C. Yoshizawa, Yiwei Ling, Shujiro Okuda

In untargeted metabolomics, structures of small molecules are annotated using liquid chromatography-mass spectrometry by leveraging information from the molecular retention time (RT) in the chromatogram and m/z (formerly called ''mass-to-charge ratio'') in the mass spectrum. However, correct identification of metabolites is challenging due to the vast array of small molecules. Therefore, various in silico tools for mass spectrometry peak alignment and compound prediction have been developed; however, the list of candidate compounds remains extensive. Accurate RT prediction is important to exclude false candidates and facilitate metabolite annotation. Recent advancements in artificial intelligence (AI) have led to significant breakthroughs in the use of deep learning models in various fields. Release of a large RT dataset has mitigated the bottlenecks limiting the application of deep learning models, thereby improving their application in RT prediction tasks. This review lists the databases that can be used to expand training datasets and concerns the issue about molecular representation inconsistencies in datasets. It also discusses the application of AI technology for RT prediction, particularly in the 5 years following the release of the METLIN small molecule RT dataset. This review provides a comprehensive overview of the AI applications used for RT prediction, highlighting the progress and remaining challenges.

在非靶向代谢组学中,通过利用色谱中的分子保留时间(RT)和质谱中的 m/z(以前称为 "质荷比")信息,使用液相色谱-质谱联用技术注释小分子的结构。然而,由于小分子的种类繁多,正确识别代谢物具有挑战性。因此,人们开发了各种用于质谱峰值配准和化合物预测的硅学工具;然而,候选化合物的清单仍然十分庞大。准确的 RT 预测对于排除错误候选化合物和促进代谢物注释非常重要。人工智能(AI)的最新进展使深度学习模型在各个领域的应用取得了重大突破。大型 RT 数据集的发布缓解了限制深度学习模型应用的瓶颈,从而改善了它们在 RT 预测任务中的应用。本综述列举了可用于扩展训练数据集的数据库,并关注数据集中分子表征不一致的问题。它还讨论了人工智能技术在 RT 预测中的应用,特别是在 METLIN 小分子 RT 数据集发布后的 5 年中。本综述全面概述了用于 RT 预测的人工智能应用,重点介绍了所取得的进展和仍然面临的挑战。本文重点介绍了过去五年来计算代谢组学在小分子保留时间预测方面取得的进展,并特别强调了人工智能技术在这一领域的应用。文章回顾了公开可用的小分子保留时间数据集、分子表征方法以及近期研究中应用的人工智能算法。此外,它还讨论了这些模型在协助小分子结构注释方面的有效性,以及为实现实际应用而必须应对的挑战。
{"title":"Insights into predicting small molecule retention times in liquid chromatography using deep learning","authors":"Yuting Liu,&nbsp;Akiyasu C. Yoshizawa,&nbsp;Yiwei Ling,&nbsp;Shujiro Okuda","doi":"10.1186/s13321-024-00905-1","DOIUrl":"10.1186/s13321-024-00905-1","url":null,"abstract":"<p>In untargeted metabolomics, structures of small molecules are annotated using liquid chromatography-mass spectrometry by leveraging information from the molecular retention time (RT) in the chromatogram and <i>m/z</i> (formerly called ''mass-to-charge ratio'') in the mass spectrum. However, correct identification of metabolites is challenging due to the vast array of small molecules. Therefore, various in silico tools for mass spectrometry peak alignment and compound prediction have been developed; however, the list of candidate compounds remains extensive. Accurate RT prediction is important to exclude false candidates and facilitate metabolite annotation. Recent advancements in artificial intelligence (AI) have led to significant breakthroughs in the use of deep learning models in various fields. Release of a large RT dataset has mitigated the bottlenecks limiting the application of deep learning models, thereby improving their application in RT prediction tasks. This review lists the databases that can be used to expand training datasets and concerns the issue about molecular representation inconsistencies in datasets. It also discusses the application of AI technology for RT prediction, particularly in the 5 years following the release of the METLIN small molecule RT dataset. This review provides a comprehensive overview of the AI applications used for RT prediction, highlighting the progress and remaining challenges.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00905-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction 结合图神经网络和转换器预测核受体结合活性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-27 DOI: 10.1186/s13321-024-00902-4
Luis H. M. Torres, Joel P. Arrais, Bernardete Ribeiro

Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.

Scientific contribution

The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.

核受体(NRs)作为生物靶点在药物研发中发挥着至关重要的作用。然而,在候选药物数量减少的情况下,确定哪些化合物可以作为内分泌干扰物并调节核受体的功能是一项具有挑战性的任务。此外,NR 结合活性预测的计算方法大多一次只针对一个受体,这可能会限制其有效性。因此,在多个 NR 之间转移所学知识可以提高分子预测器的性能,从而开发出更有效的药物。在这项研究中,我们整合了图神经网络(GNN)和变换器(Transformer),推出了一种几射 GNN-变换器 Meta-GTNRP,利用不同 NRs 的综合信息预测化合物的结合活性,并在数据有限的情况下识别潜在的 NR 调节剂。Meta-GTNRP 模型捕捉了图结构数据中的局部信息,并保留了分子图嵌入的全局语义结构,用于 NR 结合活性预测。此外,还提出了一种少量元学习方法,针对不同的 NR 结合任务优化模型参数,并利用多个 NR 特定任务之间的互补性,只需少量标记的分子就能预测化合物对每种 NR 的结合活性。使用包含 11 种 NR 结合活性注释的化合物数据库进行的实验表明,Meta-GTNRP 优于其他基于图的方法。数据和代码可在以下网址获取:https://github.com/ltorres97/Meta-GTNRP 。科学贡献 所提出的少量 GNN-Transformer 模型 Meta-GTNRP 可捕捉分子图的局部结构,并保留图嵌入的全局语义信息,从而在可用数据有限的情况下预测化合物的 NR 结合活性;在高度不平衡的数据场景中,Meta-GTNRP 是一种数据效率高的方法,它结合了 GNN 和 Transformers 的优势,通过优化的元学习程序预测化合物的 NR 结合特性,并提供有价值的稳健结果,以确定基于 NR 的潜在候选药物。
{"title":"Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction","authors":"Luis H. M. Torres,&nbsp;Joel P. Arrais,&nbsp;Bernardete Ribeiro","doi":"10.1186/s13321-024-00902-4","DOIUrl":"10.1186/s13321-024-00902-4","url":null,"abstract":"<div><p>Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.</p><p><b>Scientific contribution</b></p><p>The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00902-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models 基于集合和多任务注意力模型预测药物组合协同作用的多视角特征表征
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-27 DOI: 10.1186/s13321-024-00903-3
Samar Monem, Aboul Ella Hassanien, Alaa H. Abdel-Hamid

This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).

本文提出了一种新颖的多视角集合预测模型,旨在通过预测药物组合与癌细胞株的协同作用评分值和协同作用类别标签,解决确定协同作用药物组合的难题。所提出的方法包括通过四种不同的视图来表示药物特征:简化分子输入线输入系统(SMILES)特征、分子图特征、指纹特征和药物靶点特征。另一方面,通过四种视图捕捉细胞系特征:基因表达特征、拷贝数特征、突变特征和蛋白质组学特征。为防止模型过度拟合,我们采用了两种技术。首先,药物的每个视图特征与每个相应的细胞系视图配对,并输入多任务注意力深度学习模型。该多任务模型经过训练,可同时预测协同作用得分值和协同作用类别标签。这一过程会将十六个输入视图特征输入多任务模型,产生十六个预测值。随后,这些预测值被用作集合模型的输入,输出最终预测值。MVME "模型使用 O'Neil 数据集进行评估,该数据集包括 38 种不同药物在 39 种不同癌症细胞系中的组合,共输出 22737 对药物组合。在协同作用分值方面,建议模型的均方误差 (MSE) 为 206.57,均方根误差 (RMSE) 为 14.30,皮尔逊分值为 0.76。对于协同类标签,该模型的准确度得分为 0.90,精确度得分为 0.96,卡帕得分为 0.57,ROC 曲线下面积(ROC-AUC)得分为 0.96,精确度-召回曲线下面积(PR-AUC)得分为 0.88。本文利用四种不同的药物特征视图和四种癌症细胞系视图,提出了一种增强型协同药物组合模型。然后将每个视图输入多任务深度学习模型,以同时预测协同作用得分和类别标签。为了应对管理不同视图及其相应预测值的挑战,同时避免过拟合,应用了一个集合模型。
{"title":"A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models","authors":"Samar Monem,&nbsp;Aboul Ella Hassanien,&nbsp;Alaa H. Abdel-Hamid","doi":"10.1186/s13321-024-00903-3","DOIUrl":"10.1186/s13321-024-00903-3","url":null,"abstract":"<div><p>This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00903-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1