Pub Date : 2024-11-05DOI: 10.1186/s13321-024-00922-0
Jürgen Bajorath
Over the past ~ 25 years, chemoinformatics has evolved as a scientific discipline, with a strong foundation in pharmaceutical research and scientific roots that can be traced back to the late 1950s. It covers a wide methodological spectrum and is perhaps best positioned in the greater context of chemical information science. Herein, the chemoinformatics discipline is delineated, characteristic (and partly problematic) features are discussed, and a global view of the field is provided, emphasizing key developments.
{"title":"Milestones in chemoinformatics: global view of the field","authors":"Jürgen Bajorath","doi":"10.1186/s13321-024-00922-0","DOIUrl":"10.1186/s13321-024-00922-0","url":null,"abstract":"<div><p>Over the past ~ 25 years, chemoinformatics has evolved as a scientific discipline, with a strong foundation in pharmaceutical research and scientific roots that can be traced back to the late 1950s. It covers a wide methodological spectrum and is perhaps best positioned in the greater context of chemical information science. Herein, the chemoinformatics discipline is delineated, characteristic (and partly problematic) features are discussed, and a global view of the field is provided, emphasizing key developments.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00922-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05DOI: 10.1186/s13321-024-00917-x
Domenico Gadaleta, Marina Garcia de Lomana, Eva Serrano-Candelas, Rita Ortega-Vallbona, Rafael Gozalbes, Alessandra Roncaglioni, Emilio Benfenati
The adverse outcome pathway (AOP) concept has gained attention as a way to explore the mechanism of chemical toxicity. In this study, quantitative structure–activity relationship (QSAR) models were developed to predict compound activity toward protein targets relevant to molecular initiating events (MIE) upstream of organ-specific toxicities, namely liver steatosis, cholestasis, nephrotoxicity, neural tube closure defects, and cognitive functional defects. Utilizing bioactivity data from the ChEMBL 33 database, various machine learning algorithms, chemical features and methods to assess prediction reliability were compared and applied to develop robust models to predict compound activity. The results demonstrate high predictive performance across multiple targets, with balanced accuracy exceeding 0.80 for the majority of models. Furthermore, stability checks confirmed the consistency of predictive performance across multiple training-test splits. The results obtained by using QSAR predictions to identify known markers of adversities highlighted the utility of the models for risk assessment and for prioritizing compounds for further experimental evaluation.
Scientific contribution
The work describes the development of QSAR models as tools for screening chemicals with potential systemic toxicity, thus contributing to resource savings and providing indications for further better-targeted testing. This study provides advances in the field of computational modeling of MIEs and information from AOP which is still relatively young and unexplored. The comprehensive modeling procedure is highly generalizable, and offers a robust framework for predicting a wide range of toxicological endpoints.
{"title":"Quantitative structure–activity relationships of chemical bioactivity toward proteins associated with molecular initiating events of organ-specific toxicity","authors":"Domenico Gadaleta, Marina Garcia de Lomana, Eva Serrano-Candelas, Rita Ortega-Vallbona, Rafael Gozalbes, Alessandra Roncaglioni, Emilio Benfenati","doi":"10.1186/s13321-024-00917-x","DOIUrl":"10.1186/s13321-024-00917-x","url":null,"abstract":"<div><p>The adverse outcome pathway (AOP) concept has gained attention as a way to explore the mechanism of chemical toxicity. In this study, quantitative structure–activity relationship (QSAR) models were developed to predict compound activity toward protein targets relevant to molecular initiating events (MIE) upstream of organ-specific toxicities, namely liver steatosis, cholestasis, nephrotoxicity, neural tube closure defects, and cognitive functional defects. Utilizing bioactivity data from the ChEMBL 33 database, various machine learning algorithms, chemical features and methods to assess prediction reliability were compared and applied to develop robust models to predict compound activity. The results demonstrate high predictive performance across multiple targets, with balanced accuracy exceeding 0.80 for the majority of models. Furthermore, stability checks confirmed the consistency of predictive performance across multiple training-test splits. The results obtained by using QSAR predictions to identify known markers of adversities highlighted the utility of the models for risk assessment and for prioritizing compounds for further experimental evaluation.</p><p><b>Scientific contribution</b></p><p>The work describes the development of QSAR models as tools for screening chemicals with potential systemic toxicity, thus contributing to resource savings and providing indications for further better-targeted testing. This study provides advances in the field of computational modeling of MIEs and information from AOP which is still relatively young and unexplored. The comprehensive modeling procedure is highly generalizable, and offers a robust framework for predicting a wide range of toxicological endpoints.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00917-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05DOI: 10.1186/s13321-024-00918-w
Aleksandra Ivanova, Olena Mokshyna, Pavel Polishchuk
Molecular dynamics simulations serve as a prevalent approach for investigating the dynamic behaviour of proteins and protein–ligand complexes. Due to its versatility and speed, GROMACS stands out as a commonly utilized software platform for executing molecular dynamics simulations. However, its effective utilization requires substantial expertise in configuring, executing, and interpreting molecular dynamics trajectories. Existing automation tools are constrained in their capability to conduct simulations for large sets of compounds with minimal user intervention, or in their ability to distribute simulations across multiple servers. To address these challenges, we developed a Python-based tool that streamlines all phases of molecular dynamics simulations, encompassing preparation, execution, and analysis. This tool minimizes the required knowledge for users engaging in molecular dynamics simulations and can efficiently operate across multiple servers within a network or a cluster. Notably, the tool not only automates trajectory simulation but also facilitates the computation of free binding energies for protein–ligand complexes and generates interaction fingerprints across the trajectory. Our study demonstrated the applicability of this tool on several benchmark datasets. Additionally, we provided recommendations for end-users to effectively utilize the tool.
Scientific contribution
The developed tool, StreaMD, is applicable to different systems (proteins, ligands and their complexes including co-factors) and requires a little user knowledge to setup and run molecular dynamics simulations. Other features of StreaMD are seamless integration with calculation of MM-GBSA/PBSA binding free energies and protein-ligand interaction fingerprints, and running of simulations within distributed environments. All these will facilitate routine and massive molecular dynamics simulations.
{"title":"StreaMD: the toolkit for high-throughput molecular dynamics simulations","authors":"Aleksandra Ivanova, Olena Mokshyna, Pavel Polishchuk","doi":"10.1186/s13321-024-00918-w","DOIUrl":"10.1186/s13321-024-00918-w","url":null,"abstract":"<div><p>Molecular dynamics simulations serve as a prevalent approach for investigating the dynamic behaviour of proteins and protein–ligand complexes. Due to its versatility and speed, GROMACS stands out as a commonly utilized software platform for executing molecular dynamics simulations. However, its effective utilization requires substantial expertise in configuring, executing, and interpreting molecular dynamics trajectories. Existing automation tools are constrained in their capability to conduct simulations for large sets of compounds with minimal user intervention, or in their ability to distribute simulations across multiple servers. To address these challenges, we developed a Python-based tool that streamlines all phases of molecular dynamics simulations, encompassing preparation, execution, and analysis. This tool minimizes the required knowledge for users engaging in molecular dynamics simulations and can efficiently operate across multiple servers within a network or a cluster. Notably, the tool not only automates trajectory simulation but also facilitates the computation of free binding energies for protein–ligand complexes and generates interaction fingerprints across the trajectory. Our study demonstrated the applicability of this tool on several benchmark datasets. Additionally, we provided recommendations for end-users to effectively utilize the tool.</p><p><b>Scientific contribution</b></p><p>The developed tool, StreaMD, is applicable to different systems (proteins, ligands and their complexes including co-factors) and requires a little user knowledge to setup and run molecular dynamics simulations. Other features of StreaMD are seamless integration with calculation of MM-GBSA/PBSA binding free energies and protein-ligand interaction fingerprints, and running of simulations within distributed environments. All these will facilitate routine and massive molecular dynamics simulations.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00918-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1186/s13321-024-00919-9
Peter Willett
This article highlights research from the last century that has provided the basis for the searching techniques that are used in present-day cheminformatics systems, and thus provides an acknowledgement of the contributions made by early pioneers in the field.
{"title":"Searching chemical databases in the pre-history of cheminformatics","authors":"Peter Willett","doi":"10.1186/s13321-024-00919-9","DOIUrl":"10.1186/s13321-024-00919-9","url":null,"abstract":"<div><p>This article highlights research from the last century that has provided the basis for the searching techniques that are used in present-day cheminformatics systems, and thus provides an acknowledgement of the contributions made by early pioneers in the field.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00919-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142574316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1186/s13321-024-00912-2
Yiyu Hong, Junsu Ha, Jaemin Sim, Chae Jo Lim, Kwang-Seok Oh, Ramakrishnan Chandrasekaran, Bomin Kim, Jieun Choi, Junsu Ko, Woong-Hee Shin, Juyong Lee
We introduce an advanced model for predicting protein–ligand interactions. Our approach combines the strengths of graph neural networks with physics-based scoring methods. Existing structure-based machine-learning models for protein–ligand binding prediction often fall short in practical virtual screening scenarios, hindered by the intricacies of binding poses, the chemical diversity of drug-like molecules, and the scarcity of crystallographic data for protein–ligand complexes. To overcome the limitations of existing machine learning-based prediction models, we propose a novel approach that fuses three independent neural network models. One classification model is designed to perform binary prediction of a given protein–ligand complex pose. The other two regression models are trained to predict the binding affinity and root-mean-square deviation of a ligand conformation from an input complex structure. We trained the model to account for both deviations in experimental and predicted binding affinities and pose prediction uncertainties. By effectively integrating the outputs of the triplet neural networks with a physics-based scoring function, our model showed a significantly improved performance in hit identification. The benchmark results with three independent decoy sets demonstrate that our model outperformed existing models in forward screening. Our model achieved top 1% enrichment factors of 32.7 and 23.1 with the CASF2016 and DUD-E benchmark sets, respectively. The benchmark results using the LIT-PCBA set further confirmed its higher average enrichment factors, emphasizing the model’s efficiency and generalizability. The model’s efficiency was further validated by identifying 23 active compounds from 63 candidates in experimental screening for autotaxin inhibitors, demonstrating its practical applicability in hit discovery.
Scientific contribution
Our work introduces a novel training strategy for a protein–ligand binding affinity prediction model by integrating the outputs of three independent sub-models and utilizing expertly crafted decoy sets. The model showcases exceptional performance across multiple benchmarks. The high enrichment factors in the LIT-PCBA benchmark demonstrate its potential to accelerate hit discovery.
{"title":"Accurate prediction of protein–ligand interactions by combining physical energy functions and graph-neural networks","authors":"Yiyu Hong, Junsu Ha, Jaemin Sim, Chae Jo Lim, Kwang-Seok Oh, Ramakrishnan Chandrasekaran, Bomin Kim, Jieun Choi, Junsu Ko, Woong-Hee Shin, Juyong Lee","doi":"10.1186/s13321-024-00912-2","DOIUrl":"10.1186/s13321-024-00912-2","url":null,"abstract":"<div><p>We introduce an advanced model for predicting protein–ligand interactions. Our approach combines the strengths of graph neural networks with physics-based scoring methods. Existing structure-based machine-learning models for protein–ligand binding prediction often fall short in practical virtual screening scenarios, hindered by the intricacies of binding poses, the chemical diversity of drug-like molecules, and the scarcity of crystallographic data for protein–ligand complexes. To overcome the limitations of existing machine learning-based prediction models, we propose a novel approach that fuses three independent neural network models. One classification model is designed to perform binary prediction of a given protein–ligand complex pose. The other two regression models are trained to predict the binding affinity and root-mean-square deviation of a ligand conformation from an input complex structure. We trained the model to account for both deviations in experimental and predicted binding affinities and pose prediction uncertainties. By effectively integrating the outputs of the triplet neural networks with a physics-based scoring function, our model showed a significantly improved performance in hit identification. The benchmark results with three independent decoy sets demonstrate that our model outperformed existing models in forward screening. Our model achieved top 1% enrichment factors of 32.7 and 23.1 with the CASF2016 and DUD-E benchmark sets, respectively. The benchmark results using the LIT-PCBA set further confirmed its higher average enrichment factors, emphasizing the model’s efficiency and generalizability. The model’s efficiency was further validated by identifying 23 active compounds from 63 candidates in experimental screening for autotaxin inhibitors, demonstrating its practical applicability in hit discovery.</p><p><b>Scientific contribution</b></p><p>Our work introduces a novel training strategy for a protein–ligand binding affinity prediction model by integrating the outputs of three independent sub-models and utilizing expertly crafted decoy sets. The model showcases exceptional performance across multiple benchmarks. The high enrichment factors in the LIT-PCBA benchmark demonstrate its potential to accelerate hit discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00912-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142574819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1186/s13321-024-00915-z
Candra Zonyfar, Soualihou Ngnamsie Njimbouom, Sophia Mosalla, Jeong-Dong Kim
State‑of‑the‑art medical studies proved that predicting CYP450 enzyme inhibitors is beneficial in the early stage of drug discovery. However, accurate machine learning-based (ML) in silico methods for predicting CYP450 inhibitors remains challenging. Here, we introduce GTransCYPs, an improved graph neural network (GNN) with a transformer mechanism for predicting CYP450 inhibitors. This model significantly enhances the discrimination between inhibitors and non-inhibitors for five major CYP450 isozymes: 1A2, 2C9, 2C19, 2D6, and 3A4. GTransCYPs learns information patterns from molecular graphs by aggregating node and edge representations using a transformer. The GTransCYPs model utilizes transformer convolution layers to process features, followed by a global attention-pooling technique to synthesize the graph-level information. This information is then fed through successive linear layers for final output generation. Experimental results demonstrate that the GTransCYPs model achieved high performance, outperforming other state-of-the-art methods in CYP450 prediction.
Scientific contribution
The prediction of CYP450 inhibition via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we presented a deep learning (DL) architecture based on GNN with transformer mechanism and attention pooling (GTransCYPs) to predict CYP450 inhibitors. Four GTransCYPs of different pooling technique were tested on an experimental tasks on the CYP450 prediction problem for the first time. Graph transformer with attention pooling algorithm achieved the best performances. Comparative and ablation experiments provide evidence of the efficacy of our proposed method in predicting CYP450 inhibitors. The source code is publicly available at https://github.com/zonwoo/GTransCYPs.
{"title":"GTransCYPs: an improved graph transformer neural network with attention pooling for reliably predicting CYP450 inhibitors","authors":"Candra Zonyfar, Soualihou Ngnamsie Njimbouom, Sophia Mosalla, Jeong-Dong Kim","doi":"10.1186/s13321-024-00915-z","DOIUrl":"10.1186/s13321-024-00915-z","url":null,"abstract":"<div><p>State‑of‑the‑art medical studies proved that predicting CYP450 enzyme inhibitors is beneficial in the early stage of drug discovery. However, accurate machine learning-based (ML) in silico methods for predicting CYP450 inhibitors remains challenging. Here, we introduce GTransCYPs, an improved graph neural network (GNN) with a transformer mechanism for predicting CYP450 inhibitors. This model significantly enhances the discrimination between inhibitors and non-inhibitors for five major CYP450 isozymes: 1A2, 2C9, 2C19, 2D6, and 3A4. GTransCYPs learns information patterns from molecular graphs by aggregating node and edge representations using a transformer. The GTransCYPs model utilizes transformer convolution layers to process features, followed by a global attention-pooling technique to synthesize the graph-level information. This information is then fed through successive linear layers for final output generation. Experimental results demonstrate that the GTransCYPs model achieved high performance, outperforming other state-of-the-art methods in CYP450 prediction.</p><p><b>Scientific contribution</b></p><p>The prediction of CYP450 inhibition via computational techniques utilizing biological information has emerged as a cost-effective and highly efficient approach. Here, we presented a deep learning (DL) architecture based on GNN with transformer mechanism and attention pooling (GTransCYPs) to predict CYP450 inhibitors. Four GTransCYPs of different pooling technique were tested on an experimental tasks on the CYP450 prediction problem for the first time. Graph transformer with attention pooling algorithm achieved the best performances. Comparative and ablation experiments provide evidence of the efficacy of our proposed method in predicting CYP450 inhibitors. The source code is publicly available at https://github.com/zonwoo/GTransCYPs.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00915-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1186/s13321-024-00913-1
Sina Abdollahi, Darius P. Schaub, Madalena Barroso, Nora C. Laubach, Wiebke Hutwelker, Ulf Panzer, S.øren W. Gersting, Stefan Bonn
The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.
Scientific contribution
This work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.
{"title":"A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles","authors":"Sina Abdollahi, Darius P. Schaub, Madalena Barroso, Nora C. Laubach, Wiebke Hutwelker, Ulf Panzer, S.øren W. Gersting, Stefan Bonn","doi":"10.1186/s13321-024-00913-1","DOIUrl":"10.1186/s13321-024-00913-1","url":null,"abstract":"<div><p>The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.</p><p><b>Scientific contribution</b></p><p>This work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00913-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1186/s13321-024-00911-3
Zeqing Bao, Gary Tom, Austin Cheng, Jeffrey Watchorn, Alán Aspuru-Guzik, Christine Allen
Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE < 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available.
Scientific contribution
Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.
药物溶解度是药物开发过程中的一个重要参数,但其测量通常既繁琐又具有挑战性,尤其是对于昂贵药物或小剂量药物。为了缓解这些挑战,机器学习(ML)作为一种替代方法被应用于预测药物溶解度。然而,现有的大多数 ML 研究都侧重于预测水溶性和/或在特定温度下的溶解性,这限制了模型在药物开发中的适用性。为了弥补这一不足,我们汇编了一个包含 27,000 个溶解度数据点的数据集,其中包括在各种温度下一系列二元溶剂混合物中测得的小分子溶解度。接下来,一组 ML 模型在该数据集上进行了训练,并使用贝叶斯优化方法对其超参数进行了调整。结果表明,性能最好的模型是梯度提升决策树(轻梯度提升机和极梯度提升),在保留集上 LogS(S,单位 g/100 g)的平均绝对误差 (MAE) 为 0.33。通过一项前瞻性研究对这些模型进行了进一步验证,在这项研究中,模型预测了四种药物分子的溶解度,然后用内部溶解度实验进行了验证。这项前瞻性研究表明,模型准确预测了不同温度下溶质在特定二元溶剂混合物中的溶解度,特别是对于数据集中溶质特征非常接近的药物(LogS 的 MAE < 0.5)。为了支持未来的研究并促进该领域的进步,我们公开了数据集和代码。科学贡献 我们的研究通过利用 ML 和独特的综合数据集,推动了小分子溶解度预测领域的最新发展。现有的 ML 研究主要关注固定温度下水溶液中的溶解度,与此不同,我们的工作能够在广泛的温度范围内预测药物在各种二元溶剂混合物中的溶解度,为现实的制药应用提供了实用的溶解度建模见解。这些进展以及开放访问的数据集和代码支持药物开发过程中的重要步骤,包括新分子发现、药物分析和制剂。
{"title":"Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning","authors":"Zeqing Bao, Gary Tom, Austin Cheng, Jeffrey Watchorn, Alán Aspuru-Guzik, Christine Allen","doi":"10.1186/s13321-024-00911-3","DOIUrl":"10.1186/s13321-024-00911-3","url":null,"abstract":"<p>Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE < 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available.</p><p><b>Scientific contribution</b></p><p>Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00911-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142519918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-23DOI: 10.1186/s13321-024-00904-2
Miguel García-Ortegón, Srijit Seal, Carl Rasmussen, Andreas Bender, Sergio Bacallado
Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting.
Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.
{"title":"Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization","authors":"Miguel García-Ortegón, Srijit Seal, Carl Rasmussen, Andreas Bender, Sergio Bacallado","doi":"10.1186/s13321-024-00904-2","DOIUrl":"10.1186/s13321-024-00904-2","url":null,"abstract":"<p>Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting.</p><p>Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00904-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-23DOI: 10.1186/s13321-024-00882-5
Sadettin Y. Ugurlu, David McDonald, Shan He
A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.
Scientific Contribution
Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid–based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values ((< 0.05)) and the majority of Cohen’s D values ((> 0.5)) showed that MEF-AlloSite’s 1–6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.
控制蛋白质作用的一个重要机制是异构。与正表型配体相比,异位调节剂有可能带来许多好处,例如提高选择性和效应饱和度。鉴定新的异构位点为开发创新药物提供了前景,并加深了我们对基本生物机制的理解。通过机器学习应用等各种技术,我们在不同的蛋白质家族中发现了越来越多的异构位点,这为创造具有多种化学结构的全新药物提供了可能性。机器学习方法(如 PASSer)在仅依靠三维结构信息准确找到异构结合位点方面的功效有限。科学贡献 在进行异生结合位点识别的特征选择之前,将基于氨基酸的支持信息与三维结构知识进行整合是非常有利的。这种方法可以确保准确性和稳健性,从而提高性能。因此,我们从文献中收集了9460个相关的不同特征来表征口袋,然后开发了一个准确而稳健的模型,称为 "用于异生结合位点识别的多模型集合特征选择(MEF-AlloSite)"。该模型针对仅有 90 个蛋白质的小型训练集,采用了精确、稳健的多模式特征选择技术,以提高预测性能。这种最先进的技术从 9460 个特征中筛选出了有希望的特征,从而提高了异生结合位点识别的性能。此外,通过分析所选特征与异构结合位点之间的关系,还有助于理解复杂的蛋白质异构。MEF-AlloSite 与 PASSer2.0 和 PASSerRank 等最先进的异构位点识别方法在三个测试用例上进行了 51 次测试,并对训练集进行了不同的拆分。采用学生 t 检验和 Cohen's D 值来评估平均精度和 ROC AUC 分数分布。在三个测试案例中,大多数 p 值($$< 0.05$$)和大多数 Cohen's D 值($$> 0.5$$)都表明,MEF-AlloSite 的平均精确度和 ROC AUC 平均值比最先进的异构位点识别方法高 1-6%,具有显著的统计学意义。
{"title":"MEF-AlloSite: an accurate and robust Multimodel Ensemble Feature selection for the Allosteric Site identification model","authors":"Sadettin Y. Ugurlu, David McDonald, Shan He","doi":"10.1186/s13321-024-00882-5","DOIUrl":"10.1186/s13321-024-00882-5","url":null,"abstract":"<div><p>A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits compared to orthosteric ligands, such as increased selectivity and saturability of their effect. The identification of new allosteric sites presents prospects for the creation of innovative medications and enhances our comprehension of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as machine learning applications, which opens up possibilities for creating completely novel medications with a diverse variety of chemical structures. Machine learning methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information.</p><p><b>Scientific Contribution</b></p><p>Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid–based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterise pockets. The model employs an accurate and robust multimodal feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analysing selected features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values (<span>(< 0.05)</span>) and the majority of Cohen’s D values (<span>(> 0.5)</span>) showed that MEF-AlloSite’s 1–6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00882-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}