首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Milestones in chemoinformatics: global view of the field 化学信息学的里程碑:该领域的全球视野。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-05 DOI: 10.1186/s13321-024-00922-0
Jürgen Bajorath

Over the past ~ 25 years, chemoinformatics has evolved as a scientific discipline, with a strong foundation in pharmaceutical research and scientific roots that can be traced back to the late 1950s. It covers a wide methodological spectrum and is perhaps best positioned in the greater context of chemical information science. Herein, the chemoinformatics discipline is delineated, characteristic (and partly problematic) features are discussed, and a global view of the field is provided, emphasizing key developments.

在过去的 25 年中,化学信息学作为一门科学学科不断发展,它在制药研究方面有着坚实的基础,其科学渊源可追溯到 20 世纪 50 年代末。它涵盖了广泛的方法论范围,也许最适合放在化学信息科学的大背景下进行研究。本文对化学信息学学科进行了划分,讨论了该学科的特点(部分存在问题),并提供了该领域的全球视角,强调了主要的发展动态。
{"title":"Milestones in chemoinformatics: global view of the field","authors":"Jürgen Bajorath","doi":"10.1186/s13321-024-00922-0","DOIUrl":"10.1186/s13321-024-00922-0","url":null,"abstract":"<div><p>Over the past ~ 25 years, chemoinformatics has evolved as a scientific discipline, with a strong foundation in pharmaceutical research and scientific roots that can be traced back to the late 1950s. It covers a wide methodological spectrum and is perhaps best positioned in the greater context of chemical information science. Herein, the chemoinformatics discipline is delineated, characteristic (and partly problematic) features are discussed, and a global view of the field is provided, emphasizing key developments.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00922-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantitative structure–activity relationships of chemical bioactivity toward proteins associated with molecular initiating events of organ-specific toxicity 与器官特异性毒性分子启动事件相关的蛋白质化学生物活性的定量结构-活性关系
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-05 DOI: 10.1186/s13321-024-00917-x
Domenico Gadaleta, Marina Garcia de Lomana, Eva Serrano-Candelas, Rita Ortega-Vallbona, Rafael Gozalbes, Alessandra Roncaglioni, Emilio Benfenati

The adverse outcome pathway (AOP) concept has gained attention as a way to explore the mechanism of chemical toxicity. In this study, quantitative structure–activity relationship (QSAR) models were developed to predict compound activity toward protein targets relevant to molecular initiating events (MIE) upstream of organ-specific toxicities, namely liver steatosis, cholestasis, nephrotoxicity, neural tube closure defects, and cognitive functional defects. Utilizing bioactivity data from the ChEMBL 33 database, various machine learning algorithms, chemical features and methods to assess prediction reliability were compared and applied to develop robust models to predict compound activity. The results demonstrate high predictive performance across multiple targets, with balanced accuracy exceeding 0.80 for the majority of models. Furthermore, stability checks confirmed the consistency of predictive performance across multiple training-test splits. The results obtained by using QSAR predictions to identify known markers of adversities highlighted the utility of the models for risk assessment and for prioritizing compounds for further experimental evaluation.

Scientific contribution

The work describes the development of QSAR models as tools for screening chemicals with potential systemic toxicity, thus contributing to resource savings and providing indications for further better-targeted testing. This study provides advances in the field of computational modeling of MIEs and information from AOP which is still relatively young and unexplored. The comprehensive modeling procedure is highly generalizable, and offers a robust framework for predicting a wide range of toxicological endpoints.

作为探索化学毒性机制的一种方法,不良后果途径(AOP)的概念已受到广泛关注。本研究开发了定量结构-活性关系(QSAR)模型,用于预测化合物对器官特异性毒性(即肝脏脂肪变性、胆汁淤积、肾毒性、神经管闭合缺陷和认知功能缺陷)上游分子起始事件(MIE)相关蛋白质靶点的活性。利用 ChEMBL 33 数据库中的生物活性数据,比较并应用了各种机器学习算法、化学特征和评估预测可靠性的方法,从而开发出预测化合物活性的稳健模型。结果表明,这些模型对多个靶点都具有很高的预测性能,大多数模型的平衡准确度超过了 0.80。此外,稳定性检查证实了预测性能在多个训练-测试分段中的一致性。通过使用 QSAR 预测来识别已知的逆境标志物所获得的结果,凸显了这些模型在风险评估和确定需要进一步实验评估的化合物优先次序方面的实用性。科学贡献 该研究工作介绍了 QSAR 模型的开发情况,将其作为筛选具有潜在系统毒性的化学品的工具,从而有助于节省资源,并为进一步进行更有针对性的测试提供指示。这项研究在 MIEs 计算建模领域取得了进展,并提供了 AOP 的信息,而这一领域还相对年轻,尚未得到探索。综合建模程序具有很强的通用性,为预测各种毒理学终点提供了一个稳健的框架。
{"title":"Quantitative structure–activity relationships of chemical bioactivity toward proteins associated with molecular initiating events of organ-specific toxicity","authors":"Domenico Gadaleta,&nbsp;Marina Garcia de Lomana,&nbsp;Eva Serrano-Candelas,&nbsp;Rita Ortega-Vallbona,&nbsp;Rafael Gozalbes,&nbsp;Alessandra Roncaglioni,&nbsp;Emilio Benfenati","doi":"10.1186/s13321-024-00917-x","DOIUrl":"10.1186/s13321-024-00917-x","url":null,"abstract":"<div><p>The adverse outcome pathway (AOP) concept has gained attention as a way to explore the mechanism of chemical toxicity. In this study, quantitative structure–activity relationship (QSAR) models were developed to predict compound activity toward protein targets relevant to molecular initiating events (MIE) upstream of organ-specific toxicities, namely liver steatosis, cholestasis, nephrotoxicity, neural tube closure defects, and cognitive functional defects. Utilizing bioactivity data from the ChEMBL 33 database, various machine learning algorithms, chemical features and methods to assess prediction reliability were compared and applied to develop robust models to predict compound activity. The results demonstrate high predictive performance across multiple targets, with balanced accuracy exceeding 0.80 for the majority of models. Furthermore, stability checks confirmed the consistency of predictive performance across multiple training-test splits. The results obtained by using QSAR predictions to identify known markers of adversities highlighted the utility of the models for risk assessment and for prioritizing compounds for further experimental evaluation.</p><p><b>Scientific contribution</b></p><p>The work describes the development of QSAR models as tools for screening chemicals with potential systemic toxicity, thus contributing to resource savings and providing indications for further better-targeted testing. This study provides advances in the field of computational modeling of MIEs and information from AOP which is still relatively young and unexplored. The comprehensive modeling procedure is highly generalizable, and offers a robust framework for predicting a wide range of toxicological endpoints.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00917-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StreaMD: the toolkit for high-throughput molecular dynamics simulations StreaMD:高通量分子动力学模拟工具包
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-05 DOI: 10.1186/s13321-024-00918-w
Aleksandra Ivanova, Olena Mokshyna, Pavel Polishchuk

Molecular dynamics simulations serve as a prevalent approach for investigating the dynamic behaviour of proteins and protein–ligand complexes. Due to its versatility and speed, GROMACS stands out as a commonly utilized software platform for executing molecular dynamics simulations. However, its effective utilization requires substantial expertise in configuring, executing, and interpreting molecular dynamics trajectories. Existing automation tools are constrained in their capability to conduct simulations for large sets of compounds with minimal user intervention, or in their ability to distribute simulations across multiple servers. To address these challenges, we developed a Python-based tool that streamlines all phases of molecular dynamics simulations, encompassing preparation, execution, and analysis. This tool minimizes the required knowledge for users engaging in molecular dynamics simulations and can efficiently operate across multiple servers within a network or a cluster. Notably, the tool not only automates trajectory simulation but also facilitates the computation of free binding energies for protein–ligand complexes and generates interaction fingerprints across the trajectory. Our study demonstrated the applicability of this tool on several benchmark datasets. Additionally, we provided recommendations for end-users to effectively utilize the tool.

Scientific contribution

The developed tool, StreaMD, is applicable to different systems (proteins, ligands and their complexes including co-factors) and requires a little user knowledge to setup and run molecular dynamics simulations. Other features of StreaMD are seamless integration with calculation of MM-GBSA/PBSA binding free energies and protein-ligand interaction fingerprints, and running of simulations within distributed environments. All these will facilitate routine and massive molecular dynamics simulations.

分子动力学模拟是研究蛋白质和蛋白质配体复合物动态行为的常用方法。由于其多功能性和快速性,GROMACS 成为执行分子动力学模拟的常用软件平台。然而,要有效利用它,需要大量配置、执行和解释分子动力学轨迹的专业知识。现有的自动化工具在对大量化合物进行仿真时,只能尽量减少用户干预,或者在多个服务器之间分配仿真的能力方面受到限制。为了应对这些挑战,我们开发了一种基于 Python 的工具,可以简化分子动力学模拟的所有阶段,包括准备、执行和分析。该工具最大限度地减少了用户进行分子动力学模拟所需的知识,并能在网络或集群内的多台服务器上高效运行。值得注意的是,该工具不仅能自动进行轨迹模拟,还能帮助计算蛋白质配体复合物的自由结合能,并生成整个轨迹的相互作用指纹。我们的研究在多个基准数据集上证明了该工具的适用性。此外,我们还为最终用户提供了有效利用该工具的建议。科学贡献 开发的工具 StreaMD 适用于不同的系统(蛋白质、配体及其复合物,包括辅助因子),用户只需具备少量知识即可设置和运行分子动力学模拟。StreaMD 的其他特点还包括无缝集成 MM-GBSA/PBSA 结合自由能和蛋白质配体相互作用指纹的计算,以及在分布式环境中运行模拟。所有这些都将为常规和大规模分子动力学模拟提供便利。
{"title":"StreaMD: the toolkit for high-throughput molecular dynamics simulations","authors":"Aleksandra Ivanova,&nbsp;Olena Mokshyna,&nbsp;Pavel Polishchuk","doi":"10.1186/s13321-024-00918-w","DOIUrl":"10.1186/s13321-024-00918-w","url":null,"abstract":"<div><p>Molecular dynamics simulations serve as a prevalent approach for investigating the dynamic behaviour of proteins and protein–ligand complexes. Due to its versatility and speed, GROMACS stands out as a commonly utilized software platform for executing molecular dynamics simulations. However, its effective utilization requires substantial expertise in configuring, executing, and interpreting molecular dynamics trajectories. Existing automation tools are constrained in their capability to conduct simulations for large sets of compounds with minimal user intervention, or in their ability to distribute simulations across multiple servers. To address these challenges, we developed a Python-based tool that streamlines all phases of molecular dynamics simulations, encompassing preparation, execution, and analysis. This tool minimizes the required knowledge for users engaging in molecular dynamics simulations and can efficiently operate across multiple servers within a network or a cluster. Notably, the tool not only automates trajectory simulation but also facilitates the computation of free binding energies for protein–ligand complexes and generates interaction fingerprints across the trajectory. Our study demonstrated the applicability of this tool on several benchmark datasets. Additionally, we provided recommendations for end-users to effectively utilize the tool.</p><p><b>Scientific contribution</b></p><p>The developed tool, StreaMD, is applicable to different systems (proteins, ligands and their complexes including co-factors) and requires a little user knowledge to setup and run molecular dynamics simulations. Other features of StreaMD are seamless integration with calculation of MM-GBSA/PBSA binding free energies and protein-ligand interaction fingerprints, and running of simulations within distributed environments. All these will facilitate routine and massive molecular dynamics simulations.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00918-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Searching chemical databases in the pre-history of cheminformatics 搜索化学信息学前史中的化学数据库
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-04 DOI: 10.1186/s13321-024-00919-9
Peter Willett

This article highlights research from the last century that has provided the basis for the searching techniques that are used in present-day cheminformatics systems, and thus provides an acknowledgement of the contributions made by early pioneers in the field.

这篇文章重点介绍了上个世纪的研究成果,这些成果为当今化学信息学系统所使用的搜索技术奠定了基础,因此也是对该领域早期先驱所做贡献的认可。
{"title":"Searching chemical databases in the pre-history of cheminformatics","authors":"Peter Willett","doi":"10.1186/s13321-024-00919-9","DOIUrl":"10.1186/s13321-024-00919-9","url":null,"abstract":"<div><p>This article highlights research from the last century that has provided the basis for the searching techniques that are used in present-day cheminformatics systems, and thus provides an acknowledgement of the contributions made by early pioneers in the field.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00919-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142574316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate prediction of protein–ligand interactions by combining physical energy functions and graph-neural networks 结合物理能量函数和图神经网络,准确预测蛋白质配体之间的相互作用。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-04 DOI: 10.1186/s13321-024-00912-2
Yiyu Hong, Junsu Ha, Jaemin Sim, Chae Jo Lim, Kwang-Seok Oh, Ramakrishnan Chandrasekaran, Bomin Kim, Jieun Choi, Junsu Ko, Woong-Hee Shin, Juyong Lee

We introduce an advanced model for predicting protein–ligand interactions. Our approach combines the strengths of graph neural networks with physics-based scoring methods. Existing structure-based machine-learning models for protein–ligand binding prediction often fall short in practical virtual screening scenarios, hindered by the intricacies of binding poses, the chemical diversity of drug-like molecules, and the scarcity of crystallographic data for protein–ligand complexes. To overcome the limitations of existing machine learning-based prediction models, we propose a novel approach that fuses three independent neural network models. One classification model is designed to perform binary prediction of a given protein–ligand complex pose. The other two regression models are trained to predict the binding affinity and root-mean-square deviation of a ligand conformation from an input complex structure. We trained the model to account for both deviations in experimental and predicted binding affinities and pose prediction uncertainties. By effectively integrating the outputs of the triplet neural networks with a physics-based scoring function, our model showed a significantly improved performance in hit identification. The benchmark results with three independent decoy sets demonstrate that our model outperformed existing models in forward screening. Our model achieved top 1% enrichment factors of 32.7 and 23.1 with the CASF2016 and DUD-E benchmark sets, respectively. The benchmark results using the LIT-PCBA set further confirmed its higher average enrichment factors, emphasizing the model’s efficiency and generalizability. The model’s efficiency was further validated by identifying 23 active compounds from 63 candidates in experimental screening for autotaxin inhibitors, demonstrating its practical applicability in hit discovery.

Scientific contribution

Our work introduces a novel training strategy for a protein–ligand binding affinity prediction model by integrating the outputs of three independent sub-models and utilizing expertly crafted decoy sets. The model showcases exceptional performance across multiple benchmarks. The high enrichment factors in the LIT-PCBA benchmark demonstrate its potential to accelerate hit discovery.

我们介绍了一种用于预测蛋白质配体相互作用的先进模型。我们的方法结合了图神经网络和基于物理的评分方法的优势。现有的基于结构的蛋白质配体结合预测机器学习模型在实际的虚拟筛选场景中往往不尽如人意,这是因为结合位置错综复杂、类药物分子的化学多样性以及蛋白质配体复合物晶体学数据的稀缺性所造成的。为了克服现有基于机器学习的预测模型的局限性,我们提出了一种融合三个独立神经网络模型的新方法。其中一个分类模型旨在对给定的蛋白质配体复合体姿态进行二元预测。另外两个回归模型则用于预测配体构象与输入复合物结构的结合亲和力和均方根偏差。我们对模型进行了训练,以考虑实验结合亲和力和预测结合亲和力的偏差以及姿势预测的不确定性。通过将三重神经网络的输出与基于物理学的评分函数有效整合,我们的模型在命中识别方面的性能有了显著提高。三个独立诱饵集的基准结果表明,我们的模型在前向筛选中的表现优于现有模型。我们的模型在 CASF2016 和 DUD-E 基准集上的前 1%富集因子分别达到了 32.7 和 23.1。使用 LIT-PCBA 集的基准结果进一步证实了该模型具有更高的平均富集因子,从而强调了该模型的效率和普适性。在自体表皮生长因子抑制剂的实验筛选中,我们从 63 个候选化合物中鉴定出 23 个活性化合物,进一步验证了该模型的效率,证明了它在发现新药方面的实用性。 科学贡献我们的工作通过整合三个独立子模型的输出结果并利用专家制作的诱饵集,为蛋白质配体结合亲和力预测模型引入了一种新的训练策略。该模型在多个基准测试中表现出卓越的性能。LIT-PCBA 基准中的高富集因子证明了它在加速发现新发现方面的潜力。
{"title":"Accurate prediction of protein–ligand interactions by combining physical energy functions and graph-neural networks","authors":"Yiyu Hong,&nbsp;Junsu Ha,&nbsp;Jaemin Sim,&nbsp;Chae Jo Lim,&nbsp;Kwang-Seok Oh,&nbsp;Ramakrishnan Chandrasekaran,&nbsp;Bomin Kim,&nbsp;Jieun Choi,&nbsp;Junsu Ko,&nbsp;Woong-Hee Shin,&nbsp;Juyong Lee","doi":"10.1186/s13321-024-00912-2","DOIUrl":"10.1186/s13321-024-00912-2","url":null,"abstract":"<div><p>We introduce an advanced model for predicting protein–ligand interactions. Our approach combines the strengths of graph neural networks with physics-based scoring methods. Existing structure-based machine-learning models for protein–ligand binding prediction often fall short in practical virtual screening scenarios, hindered by the intricacies of binding poses, the chemical diversity of drug-like molecules, and the scarcity of crystallographic data for protein–ligand complexes. To overcome the limitations of existing machine learning-based prediction models, we propose a novel approach that fuses three independent neural network models. One classification model is designed to perform binary prediction of a given protein–ligand complex pose. The other two regression models are trained to predict the binding affinity and root-mean-square deviation of a ligand conformation from an input complex structure. We trained the model to account for both deviations in experimental and predicted binding affinities and pose prediction uncertainties. By effectively integrating the outputs of the triplet neural networks with a physics-based scoring function, our model showed a significantly improved performance in hit identification. The benchmark results with three independent decoy sets demonstrate that our model outperformed existing models in forward screening. Our model achieved top 1% enrichment factors of 32.7 and 23.1 with the CASF2016 and DUD-E benchmark sets, respectively. The benchmark results using the LIT-PCBA set further confirmed its higher average enrichment factors, emphasizing the model’s efficiency and generalizability. The model’s efficiency was further validated by identifying 23 active compounds from 63 candidates in experimental screening for autotaxin inhibitors, demonstrating its practical applicability in hit discovery.</p><p><b>Scientific contribution</b></p><p>Our work introduces a novel training strategy for a protein–ligand binding affinity prediction model by integrating the outputs of three independent sub-models and utilizing expertly crafted decoy sets. The model showcases exceptional performance across multiple benchmarks. The high enrichment factors in the LIT-PCBA benchmark demonstrate its potential to accelerate hit discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00912-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142574819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors GTransCYPs:一种改进的图变换器神经网络,采用注意力汇集法可靠预测 CYP450 抑制剂
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-29 DOI: 10.1186/s13321-024-00915-z
Candra Zonyfar, Soualihou Ngnamsie Njimbouom, Sophia Mosalla, Jeong-Dong Kim

State‑of‑the‑art medical studies proved that predicting CYP450 enzyme inhibitors is beneficial in the early stage of drug discovery. However, accurate machine learning-based (ML) in silico methods for predicting CYP450 inhibitors remains challenging. Here, we introduce GTransCYPs, an improved graph neural network (GNN) with a transformer mechanism for predicting CYP450 inhibitors. This model significantly enhances the discrimination between inhibitors and non-inhibitors for five major CYP450 isozymes: 1A2, 2C9, 2C19, 2D6, and 3A4. GTransCYPs learns information patterns from molecular graphs by aggregating node and edge representations using a transformer. The GTransCYPs model utilizes transformer convolution layers to process features, followed by a global attention-pooling technique to synthesize the graph-level information. This information is then fed through successive linear layers for final output generation. Experimental results demonstrate that the GTransCYPs model achieved high performance, outperforming other state-of-the-art methods in CYP450 prediction.

Scientific contribution

The prediction of CYP450 inhibition via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we presented a deep learning (DL) architecture based on GNN with transformer mechanism and attention pooling (GTransCYPs) to predict CYP450 inhibitors. Four GTransCYPs of different pooling technique were tested on an experimental tasks on the CYP450 prediction problem for the first time. Graph transformer with attention pooling algorithm achieved the best performances. Comparative and ablation experiments provide evidence of the efficacy of our proposed method in predicting CYP450 inhibitors. The source code is publicly available at https://github.com/zonwoo/GTransCYPs.

最先进的医学研究证明,预测 CYP450 酶抑制剂有利于药物发现的早期阶段。然而,基于机器学习(ML)的准确预测 CYP450 抑制剂的硅学方法仍然具有挑战性。在此,我们介绍了 GTransCYPs,这是一种具有转换器机制的改进型图神经网络(GNN),用于预测 CYP450 抑制剂。该模型大大提高了对 1A2、2C9、2C19、2D6 和 3A4 五种主要 CYP450 同工酶的抑制剂和非抑制剂的辨别能力。GTransCYPs 通过使用转换器聚合节点和边缘表示,从分子图中学习信息模式。GTransCYPs 模型利用变换器卷积层处理特征,然后利用全局注意力汇集技术合成图层信息。然后,这些信息通过连续的线性层输送到最终输出生成。实验结果表明,GTransCYPs 模型实现了高性能,在 CYP450 预测方面优于其他最先进的方法。科学贡献通过利用生物信息的计算技术预测 CYP450 抑制已成为一种经济高效的方法。在此,我们提出了一种基于具有变压器机制和注意力集合(GTransCYPs)的 GNN 深度学习(DL)架构,用于预测 CYP450 抑制剂。我们首次在 CYP450 预测问题的实验任务中测试了四种不同集合技术的 GTransCYPs。采用注意力汇集算法的图形变换器取得了最佳性能。对比实验和消融实验证明了我们提出的方法在预测 CYP450 抑制剂方面的有效性。源代码可在 https://github.com/zonwoo/GTransCYPs 公开获取。
{"title":"GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors","authors":"Candra Zonyfar,&nbsp;Soualihou Ngnamsie Njimbouom,&nbsp;Sophia Mosalla,&nbsp;Jeong-Dong Kim","doi":"10.1186/s13321-024-00915-z","DOIUrl":"10.1186/s13321-024-00915-z","url":null,"abstract":"<div><p>State‑of‑the‑art medical studies proved that predicting CYP450 enzyme inhibitors is beneficial in the early stage of drug discovery. However, accurate machine learning-based (ML) in silico methods for predicting CYP450 inhibitors remains challenging. Here, we introduce GTransCYPs, an improved graph neural network (GNN) with a transformer mechanism for predicting CYP450 inhibitors. This model significantly enhances the discrimination between inhibitors and non-inhibitors for five major CYP450 isozymes: 1A2, 2C9, 2C19, 2D6, and 3A4. GTransCYPs learns information patterns from molecular graphs by aggregating node and edge representations using a transformer. The GTransCYPs model utilizes transformer convolution layers to process features, followed by a global attention-pooling technique to synthesize the graph-level information. This information is then fed through successive linear layers for final output generation. Experimental results demonstrate that the GTransCYPs model achieved high performance, outperforming other state-of-the-art methods in CYP450 prediction.</p><p><b>Scientific contribution</b></p><p>The prediction of CYP450 inhibition via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we presented a deep learning (DL) architecture based on GNN with transformer mechanism and attention pooling (GTransCYPs) to predict CYP450 inhibitors. Four GTransCYPs of different pooling technique were tested on an experimental tasks on the CYP450 prediction problem for the first time. Graph transformer with attention pooling algorithm achieved the best performances. Comparative and ablation experiments provide evidence of the efficacy of our proposed method in predicting CYP450 inhibitors. The source code is publicly available at https://github.com/zonwoo/GTransCYPs.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00915-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles 全面比较基于深度学习的化合物-目标相互作用预测模型,揭示指导性设计原则。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-28 DOI: 10.1186/s13321-024-00913-1
Sina Abdollahi, Darius P. Schaub, Madalena Barroso, Nora C. Laubach, Wiebke Hutwelker, Ulf Panzer, S.øren W. Gersting, Stefan Bonn

The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.

Scientific contribution

This work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.

评估化合物-靶标相互作用(CTIs)是药物发现工作的核心。鉴于经典实验筛选需要花费大量的时间和金钱,人们一直致力于开发能准确预测 CTIs 的基于深度学习的模型。然而,目前还缺乏对这些模型在大型、经过策划的 CTI 数据集上的全面比较。在此,我们对使用不同蛋白质和化合物表征的 12 种最先进的深度学习架构进行了深入比较。这些模型是根据其报告的性能和架构筛选出来的。为了可靠地比较模型性能,我们整理了 30 多万个结合和非结合 CTI,并建立了几个不同规模和信息的黄金标准数据集。根据我们的研究结果,在大多数数据集上,DeepConv-DTI 的 CTI 预测性能始终优于其他模型。在大多数数据集上,它的 MCC 达到 0.6 或更高,是训练和推理速度最快的模型之一。这些结果表明,利用 DeepConv-DTI 中基于卷积的窗口来遍历可训练嵌入是捕捉蛋白质信息特征的一种非常有效的方法。我们还观察到,目标的物理化学嵌入提高了模型性能。因此,我们对 DeepConv-DTI 进行了修改,加入了归一化的物理化学特性,从而产生了整体性能最佳的模型 Phys-DeepConv-DTI。这项工作凸显了对化合物和目标的输入特征及其相应的神经网络架构进行系统评估,可作为未来开发改进型 CTI 模型的路线图。基于该数据集,我们深入了解了哪些化合物和靶标的嵌入以及哪些基于深度学习的算法表现最佳,为 CTI 算法的未来发展提供了蓝图。利用从这一筛选中获得的洞察力,我们提供了一种具有最先进性能的新型 CTI 算法。
{"title":"A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles","authors":"Sina Abdollahi,&nbsp;Darius P. Schaub,&nbsp;Madalena Barroso,&nbsp;Nora C. Laubach,&nbsp;Wiebke Hutwelker,&nbsp;Ulf Panzer,&nbsp;S.øren W. Gersting,&nbsp;Stefan Bonn","doi":"10.1186/s13321-024-00913-1","DOIUrl":"10.1186/s13321-024-00913-1","url":null,"abstract":"<div><p>The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.</p><p><b>Scientific contribution</b></p><p>This work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00913-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning 利用机器学习预测不同温度下药物在二元溶剂混合物中的溶解度
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-28 DOI: 10.1186/s13321-024-00911-3
Zeqing Bao, Gary Tom, Austin Cheng, Jeffrey Watchorn, Alán Aspuru-Guzik, Christine Allen

Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE < 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available.

Scientific contribution

Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.

药物溶解度是药物开发过程中的一个重要参数,但其测量通常既繁琐又具有挑战性,尤其是对于昂贵药物或小剂量药物。为了缓解这些挑战,机器学习(ML)作为一种替代方法被应用于预测药物溶解度。然而,现有的大多数 ML 研究都侧重于预测水溶性和/或在特定温度下的溶解性,这限制了模型在药物开发中的适用性。为了弥补这一不足,我们汇编了一个包含 27,000 个溶解度数据点的数据集,其中包括在各种温度下一系列二元溶剂混合物中测得的小分子溶解度。接下来,一组 ML 模型在该数据集上进行了训练,并使用贝叶斯优化方法对其超参数进行了调整。结果表明,性能最好的模型是梯度提升决策树(轻梯度提升机和极梯度提升),在保留集上 LogS(S,单位 g/100 g)的平均绝对误差 (MAE) 为 0.33。通过一项前瞻性研究对这些模型进行了进一步验证,在这项研究中,模型预测了四种药物分子的溶解度,然后用内部溶解度实验进行了验证。这项前瞻性研究表明,模型准确预测了不同温度下溶质在特定二元溶剂混合物中的溶解度,特别是对于数据集中溶质特征非常接近的药物(LogS 的 MAE < 0.5)。为了支持未来的研究并促进该领域的进步,我们公开了数据集和代码。科学贡献 我们的研究通过利用 ML 和独特的综合数据集,推动了小分子溶解度预测领域的最新发展。现有的 ML 研究主要关注固定温度下水溶液中的溶解度,与此不同,我们的工作能够在广泛的温度范围内预测药物在各种二元溶剂混合物中的溶解度,为现实的制药应用提供了实用的溶解度建模见解。这些进展以及开放访问的数据集和代码支持药物开发过程中的重要步骤,包括新分子发现、药物分析和制剂。
{"title":"Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning","authors":"Zeqing Bao,&nbsp;Gary Tom,&nbsp;Austin Cheng,&nbsp;Jeffrey Watchorn,&nbsp;Alán Aspuru-Guzik,&nbsp;Christine Allen","doi":"10.1186/s13321-024-00911-3","DOIUrl":"10.1186/s13321-024-00911-3","url":null,"abstract":"<p>Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE &lt; 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available.</p><p><b>Scientific contribution</b></p><p>Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00911-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142519918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization 分子的图神经过程:对接得分评估和提高通用性的策略
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-23 DOI: 10.1186/s13321-024-00904-2
Miguel García-Ortegón, Srijit Seal, Carl Rasmussen, Andreas Bender, Sergio Bacallado

Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting.

Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.

神经过程(NP)是一种元学习模型,可输出不确定性估计值。迄今为止,大多数关于 NP 的研究都集中在高度相关任务的低维数据集上。虽然这些同质数据集有助于制定基准,但它们可能并不能代表现实的迁移学习。特别是,由于元测试任务的潜在新颖性,科学研究中的应用可能证明特别具有挑战性。分子性质预测就是这样一个研究领域,其特点是共享分子空间上许多函数的稀疏数据集。在本文中,我们利用 DOCKSTRING(一个多样化的对接得分数据集)研究了图 NP 在分子性质预测中的应用。与化学信息学中常见的监督学习基线以及迁移学习和元学习的替代技术相比,图 NPs 在少量学习任务中表现出了极具竞争力的性能。为了提高对不同测试函数的元泛化能力,我们提出了调整 NPs 参数的微调策略。我们发现,调整可以大幅提高 NPs 的回归性能,同时保持不确定性估计的良好校准。最后,我们介绍了一个贝叶斯优化实验,该实验展示了 NPs 在迭代筛选中相对于高斯过程的潜在优势。总之,我们的研究结果表明,分子图上的神经过程在低数据环境下的分子性质预测方面具有巨大潜力。神经过程是元学习算法的一个系列,它通过跨任务传递信息和进行概率预测来应对数据稀缺问题。我们利用对接得分评估了它们在回归和优化分子任务上的性能,发现它们优于经典的单一任务和迁移学习模型。我们研究了元学习算法在科学领域普遍关注的对不同测试任务的泛化问题,并提出了缓解这一问题的策略。
{"title":"Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization","authors":"Miguel García-Ortegón,&nbsp;Srijit Seal,&nbsp;Carl Rasmussen,&nbsp;Andreas Bender,&nbsp;Sergio Bacallado","doi":"10.1186/s13321-024-00904-2","DOIUrl":"10.1186/s13321-024-00904-2","url":null,"abstract":"<p>Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting.</p><p>Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00904-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MEF-AlloSite: an accurate and robust Multimodel Ensemble Feature selection for the Allosteric Site identification model MEF-AlloSite:针对异位基因位点识别模型的精确、稳健的多模型集合特征选择
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-23 DOI: 10.1186/s13321-024-00882-5
Sadettin Y. Ugurlu, David McDonald, Shan He

A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.

Scientific Contribution

Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid–based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values ((< 0.05)) and the majority of Cohen’s D values ((> 0.5)) showed that MEF-AlloSite’s 1–6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.

控制蛋白质作用的一个重要机制是异构。与正表型配体相比,异位调节剂有可能带来许多好处,例如提高选择性和效应饱和度。鉴定新的异构位点为开发创新药物提供了前景,并加深了我们对基本生物机制的理解。通过机器学习应用等各种技术,我们在不同的蛋白质家族中发现了越来越多的异构位点,这为创造具有多种化学结构的全新药物提供了可能性。机器学习方法(如 PASSer)在仅依靠三维结构信息准确找到异构结合位点方面的功效有限。科学贡献 在进行异生结合位点识别的特征选择之前,将基于氨基酸的支持信息与三维结构知识进行整合是非常有利的。这种方法可以确保准确性和稳健性,从而提高性能。因此,我们从文献中收集了9460个相关的不同特征来表征口袋,然后开发了一个准确而稳健的模型,称为 "用于异生结合位点识别的多模型集合特征选择(MEF-AlloSite)"。该模型针对仅有 90 个蛋白质的小型训练集,采用了精确、稳健的多模式特征选择技术,以提高预测性能。这种最先进的技术从 9460 个特征中筛选出了有希望的特征,从而提高了异生结合位点识别的性能。此外,通过分析所选特征与异构结合位点之间的关系,还有助于理解复杂的蛋白质异构。MEF-AlloSite 与 PASSer2.0 和 PASSerRank 等最先进的异构位点识别方法在三个测试用例上进行了 51 次测试,并对训练集进行了不同的拆分。采用学生 t 检验和 Cohen's D 值来评估平均精度和 ROC AUC 分数分布。在三个测试案例中,大多数 p 值($$< 0.05$$)和大多数 Cohen's D 值($$> 0.5$$)都表明,MEF-AlloSite 的平均精确度和 ROC AUC 平均值比最先进的异构位点识别方法高 1-6%,具有显著的统计学意义。
{"title":"MEF-AlloSite: an accurate and robust Multimodel Ensemble Feature selection for the Allosteric Site identification model","authors":"Sadettin Y. Ugurlu,&nbsp;David McDonald,&nbsp;Shan He","doi":"10.1186/s13321-024-00882-5","DOIUrl":"10.1186/s13321-024-00882-5","url":null,"abstract":"<div><p>A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.</p><p><b>Scientific Contribution</b></p><p>Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid–based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values (<span>(&lt; 0.05)</span>) and the majority of Cohen’s D values (<span>(&gt; 0.5)</span>) showed that MEF-AlloSite’s 1–6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00882-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1