首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
PretoxTM: a text mining system for extracting treatment-related findings from preclinical toxicology reports
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-02-03 DOI: 10.1186/s13321-024-00925-x
Javier Corvi, Nicolás Díaz-Roussel, José M. Fernández, Francesco Ronzano, Emilio Centeno, Pablo Accuosto, Celine Ibrahim, Shoji Asakura, Frank Bringezu, Mirjam Fröhlicher, Annika Kreuchwig, Yoko Nogami, Jeong Rih, Raul Rodriguez-Esteban, Nicolas Sajot, Joerg Wichard, Heng-Yi Michael Wu, Philip Drew, Thomas Steger-Hartmann, Alfonso Valencia, Laura I. Furlong, Salvador Capella-Gutierrez

Over the last few decades the pharmaceutical industry has generated a vast corpus of knowledge on the safety and efficacy of drugs. Much of this information is contained in toxicology reports, which summarise the results of animal studies designed to analyse the effects of the tested compound, including unintended pharmacological and toxic effects, known as treatment-related findings. Despite the potential of this knowledge, the fact that most of this relevant information is only available as unstructured text with variable degrees of digitisation has hampered its systematic access, use and exploitation. Text mining technologies have the ability to automatically extract, analyse and aggregate such information, providing valuable new insights into the drug discovery and development process. In the context of the eTRANSAFE project, we present PretoxTM (Preclinical Toxicology Text Mining), the first system specifically designed to detect, extract, organise and visualise treatment-related findings from toxicology reports. The PretoxTM tool comprises three main components: PretoxTM Corpus, PretoxTM Pipeline and PretoxTM Web App. The PretoxTM Corpus is a gold standard corpus of preclinical treatment-related findings annotated by toxicology experts. This corpus was used to develop, train and validate the PretoxTM Pipeline, which extracts treatment-related findings from preclinical study reports. The extracted information is then presented for expert visualisation and validation in the PretoxTM Web App.

Scientific Contribution

While text mining solutions have been widely used in the clinical domain to identify adverse drug reactions from various sources, no similar systems exist for identifying adverse events in animal models during preclinical testing. PretoxTM fills this gap by efficiently extracting treatment-related findings from preclinical toxicology reports. This provides a valuable resource for toxicology research, enhancing the efficiency of safety evaluations, saving time, and leading to more effective decision-making in the drug development process.

{"title":"PretoxTM: a text mining system for extracting treatment-related findings from preclinical toxicology reports","authors":"Javier Corvi,&nbsp;Nicolás Díaz-Roussel,&nbsp;José M. Fernández,&nbsp;Francesco Ronzano,&nbsp;Emilio Centeno,&nbsp;Pablo Accuosto,&nbsp;Celine Ibrahim,&nbsp;Shoji Asakura,&nbsp;Frank Bringezu,&nbsp;Mirjam Fröhlicher,&nbsp;Annika Kreuchwig,&nbsp;Yoko Nogami,&nbsp;Jeong Rih,&nbsp;Raul Rodriguez-Esteban,&nbsp;Nicolas Sajot,&nbsp;Joerg Wichard,&nbsp;Heng-Yi Michael Wu,&nbsp;Philip Drew,&nbsp;Thomas Steger-Hartmann,&nbsp;Alfonso Valencia,&nbsp;Laura I. Furlong,&nbsp;Salvador Capella-Gutierrez","doi":"10.1186/s13321-024-00925-x","DOIUrl":"10.1186/s13321-024-00925-x","url":null,"abstract":"<div><p>Over the last few decades the pharmaceutical industry has generated a vast corpus of knowledge on the safety and efficacy of drugs. Much of this information is contained in toxicology reports, which summarise the results of animal studies designed to analyse the effects of the tested compound, including unintended pharmacological and toxic effects, known as treatment-related findings. Despite the potential of this knowledge, the fact that most of this relevant information is only available as unstructured text with variable degrees of digitisation has hampered its systematic access, use and exploitation. Text mining technologies have the ability to automatically extract, analyse and aggregate such information, providing valuable new insights into the drug discovery and development process. In the context of the eTRANSAFE project, we present PretoxTM (Preclinical Toxicology Text Mining), the first system specifically designed to detect, extract, organise and visualise treatment-related findings from toxicology reports. The PretoxTM tool comprises three main components: PretoxTM Corpus, PretoxTM Pipeline and PretoxTM Web App. The PretoxTM Corpus is a gold standard corpus of preclinical treatment-related findings annotated by toxicology experts. This corpus was used to develop, train and validate the PretoxTM Pipeline, which extracts treatment-related findings from preclinical study reports. The extracted information is then presented for expert visualisation and validation in the PretoxTM Web App.</p><p><b>Scientific Contribution</b></p><p>While text mining solutions have been widely used in the clinical domain to identify adverse drug reactions from various sources, no similar systems exist for identifying adverse events in animal models during preclinical testing. PretoxTM fills this gap by efficiently extracting treatment-related findings from preclinical toxicology reports. This provides a valuable resource for toxicology research, enhancing the efficiency of safety evaluations, saving time, and leading to more effective decision-making in the drug development process.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00925-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143077582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APBIO: bioactive profiling of air pollutants through inferred bioactivity signatures and prediction of novel target interactions
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-31 DOI: 10.1186/s13321-025-00961-1
Eva Viesi, Ugo Perricone, Patrick Aloy, Rosalba Giugno

More sophisticated representations of compounds attempt to incorporate not only information on the structure and physicochemical properties of molecules, but also knowledge about their biological traits, leading to the so-called bioactivity profile. The bioactive profiling of air pollutants is challenging and crucial, as their biological activity and toxicological effects have not been deeply investigated yet, and further exploration could shed light on the impact of air pollution on complex disorders. Therefore, a biological signature that simultaneously captures the chemistry and the biology of small molecules may be beneficial in predicting the behaviour of such ligands towards a protein target. Moreover, the interactivity between biological entities can be represented through combined feature vectors that can be given as input to a machine learning (ML) model to capture the underlying interaction. To this end, we propose a chemogenomic approach, called Air Pollutant Bioactivity (APBIO), which integrates compound bioactivity signatures and target sequence descriptors to train ML classifiers subsequently used to predict potential compound-target interactions (CTIs). We report the performances of the proposed methodology and, via external validation sets, demonstrate its outperformance compared to existing molecular representations in terms of model generalizability. We have also developed a publicly available Streamlit application for APBIO at ap-bio.streamlit.app, allowing users to predict associations between investigated compounds and protein targets.

Scientific contribution

We derived ex novo bioactivity signatures for air pollutant molecules to capture their biological behaviour and associations with protein targets. The proposed chemogenomic methodology enables the prediction of novel CTIs for known or similar compounds and targets through well-established and efficient ML models, deepening our insight into the molecular interactions and mechanisms that may have a deleterious impact on human biological systems.

{"title":"APBIO: bioactive profiling of air pollutants through inferred bioactivity signatures and prediction of novel target interactions","authors":"Eva Viesi,&nbsp;Ugo Perricone,&nbsp;Patrick Aloy,&nbsp;Rosalba Giugno","doi":"10.1186/s13321-025-00961-1","DOIUrl":"10.1186/s13321-025-00961-1","url":null,"abstract":"<div><p>More sophisticated representations of compounds attempt to incorporate not only information on the structure and physicochemical properties of molecules, but also knowledge about their biological traits, leading to the so-called bioactivity profile. The bioactive profiling of air pollutants is challenging and crucial, as their biological activity and toxicological effects have not been deeply investigated yet, and further exploration could shed light on the impact of air pollution on complex disorders. Therefore, a biological signature that simultaneously captures the chemistry and the biology of small molecules may be beneficial in predicting the behaviour of such ligands towards a protein target. Moreover, the interactivity between biological entities can be represented through combined feature vectors that can be given as input to a machine learning (ML) model to capture the underlying interaction. To this end, we propose a chemogenomic approach, called Air Pollutant Bioactivity (APBIO), which integrates compound bioactivity signatures and target sequence descriptors to train ML classifiers subsequently used to predict potential compound-target interactions (CTIs). We report the performances of the proposed methodology and, via external validation sets, demonstrate its outperformance compared to existing molecular representations in terms of model generalizability. We have also developed a publicly available Streamlit application for APBIO at ap-bio.streamlit.app, allowing users to predict associations between investigated compounds and protein targets.</p><p><b>Scientific contribution</b></p><p>We derived ex novo bioactivity signatures for air pollutant molecules to capture their biological behaviour and associations with protein targets. The proposed chemogenomic methodology enables the prediction of novel CTIs for known or similar compounds and targets through well-established and efficient ML models, deepening our insight into the molecular interactions and mechanisms that may have a deleterious impact on human biological systems.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00961-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-31 DOI: 10.1186/s13321-025-00950-4
Katarzyna Arturi, Eliza J. Harris, Lilian Gasser, Beate I. Escher, Georg Braun, Robin Bosshard, Juliane Hollender

MLinvitroTox is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). MLinvitroTox is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, MLinvitroTox generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of MLinvitroTox are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment.

Scientific Contribution:

In contrast to the classical ML-based approaches for toxicity prediction, MLinvitroTox predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a MLinvitroTox v1 KNIME workflow, in this study, we release a Python MLinvitroTox v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released pytcpl Python package for the custom processing of invitroDBv4.1 input data used for training MLinvitroTox, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.

{"title":"MLinvitroTox reloaded for high-throughput hazard-based prioritization of high-resolution mass spectrometry data","authors":"Katarzyna Arturi,&nbsp;Eliza J. Harris,&nbsp;Lilian Gasser,&nbsp;Beate I. Escher,&nbsp;Georg Braun,&nbsp;Robin Bosshard,&nbsp;Juliane Hollender","doi":"10.1186/s13321-025-00950-4","DOIUrl":"10.1186/s13321-025-00950-4","url":null,"abstract":"<div><p><span>MLinvitroTox</span> is an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HRMS/MS). <span>MLinvitroTox</span> is a machine learning (ML) framework comprising 490 independent XGBoost classifiers trained on molecular fingerprints from chemical structures and target-specific endpoints from the ToxCast/Tox21 invitroDBv4.1 database. For each analyzed HRMS feature, <span>MLinvitroTox</span> generates a 490-bit bioactivity fingerprint used as a basis for prioritization, focusing the time-consuming molecular identification efforts on features most likely to cause adverse effects. The practical advantages of <span>MLinvitroTox</span> are demonstrated for groundwater HRMS data. Among the 874 features for which molecular fingerprints were derived from spectra, including 630 nontargets, 185 spectral matches, and 59 targets, around 4% of the feature/endpoint relationship pairs were predicted to be active. Cross-checking the predictions for targets and spectral matches with invitroDB data confirmed the bioactivity of 120 active and 6791 nonactive pairs while mislabeling 88 active and 56 non-active relationships. By filtering according to bioactivity probability, endpoint scores, and similarity to the training data, the number of potentially toxic features was reduced by at least one order of magnitude. This refinement makes the analytical confirmation of the toxicologically most relevant features feasible, offering significant benefits for cost-efficient chemical risk assessment.</p><p><b>Scientific Contribution:</b></p><p>In contrast to the classical ML-based approaches for toxicity prediction, <span>MLinvitroTox</span> predicts bioactivity for HRMS features (i.e., distinct m/z signals) based on MS2 fragmentation spectra rather than the chemical structures from the identified features. While the original proof of concept study was accompanied by the release of a <span>MLinvitroTox</span> v1 KNIME workflow, in this study, we release a Python <span>MLinvitroTox</span> v2 package, which, in addition to automation, expands functionality to include predicting toxicity from structures, cleaning up and generating chemical fingerprints, customizing models, and retraining on custom data. Furthermore, as a result of improvements in bioactivity data processing, realized in the concurrently released <span>pytcpl</span> Python package for the custom processing of invitroDBv4.1 input data used for training <span>MLinvitroTox</span>, the current release introduces enhancements in model accuracy, coverage of biological mechanistic targets, and overall interpretability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00950-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AiGPro: a multi-tasks model for profiling of GPCRs for agonist and antagonist
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-29 DOI: 10.1186/s13321-024-00945-7
Rahul Brahma, Sunghyun Moon, Jae-Min Shin, Kwang-Hwi Cho
<p>G protein-coupled receptors (GPCRs) play vital roles in various physiological processes, making them attractive drug discovery targets. Meanwhile, deep learning techniques have revolutionized drug discovery by facilitating efficient tools for expediting the identification and optimization of ligands. However, existing models for the GPCRs often focus on single-target or a small subset of GPCRs or employ binary classification, constraining their applicability for high throughput virtual screening. To address these issues, we introduce AiGPro, a novel multitask model designed to predict small molecule agonists (EC<sub>50</sub>) and antagonists (IC<sub>50</sub>) across the 231 human GPCRs, making it a first-in-class solution for large-scale GPCR profiling.</p><p>Leveraging multi-scale context aggregation and bidirectional multi-head cross-attention mechanisms, our approach demonstrates that ensemble models may not be necessary for predicting complex GPCR states and small molecule interactions. Through extensive validation using stratified tenfold cross-validation, AiGPro achieves robust performance with Pearson's correlation coefficient of 0.91, indicating broad generalizability. This breakthrough sets a new standard in the GPCR studies, outperforming previous studies. Moreover, our first-in-class multi-tasking model can predict agonist and antagonist activities across a wide range of GPCRs, offering a comprehensive perspective on ligand bioactivity within this diverse superfamily. To facilitate easy accessibility, we have deployed a web-based platform for model access at https://aicadd.ssu.ac.kr/AiGPro.</p><p><b>Scientific Contribution </b>We introduce a deep learning-based multi-task model to generalize the agonist and antagonist bioactivity prediction for GPCRs accurately. The model is implemented on a user-friendly web server to facilitate rapid screening of small-molecule libraries, expediting GPCR-targeted drug discovery. Covering a diverse set of 231 GPCR targets, the platform delivers a robust, scalable solution for advancing GPCR-focused therapeutic development.</p><p>The proposed framework incorporates an innovative dual-label prediction strategy, enabling the simultaneous classification of molecules as agonists, antagonists, or both. Each prediction is further accompanied by a confidence score, offering a quantitative measure of activity likelihood. This advancement moves beyond conventional models focusing solely on binding affinity, providing a more comprehensive understanding of ligand-receptor interactions.</p><p>At the core of our model lies the Bi-Directional Multi-Head Cross-Attention (BMCA) module, a novel architecture that captures forward and backward contextual embeddings of protein and ligand features. By leveraging BMCA, the model effectively integrates structural and sequence-level information, ensuring a precise representation of molecular interactions. Results show that this approach is highly accurate in binding affini
{"title":"AiGPro: a multi-tasks model for profiling of GPCRs for agonist and antagonist","authors":"Rahul Brahma,&nbsp;Sunghyun Moon,&nbsp;Jae-Min Shin,&nbsp;Kwang-Hwi Cho","doi":"10.1186/s13321-024-00945-7","DOIUrl":"10.1186/s13321-024-00945-7","url":null,"abstract":"&lt;p&gt;G protein-coupled receptors (GPCRs) play vital roles in various physiological processes, making them attractive drug discovery targets. Meanwhile, deep learning techniques have revolutionized drug discovery by facilitating efficient tools for expediting the identification and optimization of ligands. However, existing models for the GPCRs often focus on single-target or a small subset of GPCRs or employ binary classification, constraining their applicability for high throughput virtual screening. To address these issues, we introduce AiGPro, a novel multitask model designed to predict small molecule agonists (EC&lt;sub&gt;50&lt;/sub&gt;) and antagonists (IC&lt;sub&gt;50&lt;/sub&gt;) across the 231 human GPCRs, making it a first-in-class solution for large-scale GPCR profiling.&lt;/p&gt;&lt;p&gt;Leveraging multi-scale context aggregation and bidirectional multi-head cross-attention mechanisms, our approach demonstrates that ensemble models may not be necessary for predicting complex GPCR states and small molecule interactions. Through extensive validation using stratified tenfold cross-validation, AiGPro achieves robust performance with Pearson's correlation coefficient of 0.91, indicating broad generalizability. This breakthrough sets a new standard in the GPCR studies, outperforming previous studies. Moreover, our first-in-class multi-tasking model can predict agonist and antagonist activities across a wide range of GPCRs, offering a comprehensive perspective on ligand bioactivity within this diverse superfamily. To facilitate easy accessibility, we have deployed a web-based platform for model access at https://aicadd.ssu.ac.kr/AiGPro.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Scientific Contribution &lt;/b&gt;We introduce a deep learning-based multi-task model to generalize the agonist and antagonist bioactivity prediction for GPCRs accurately. The model is implemented on a user-friendly web server to facilitate rapid screening of small-molecule libraries, expediting GPCR-targeted drug discovery. Covering a diverse set of 231 GPCR targets, the platform delivers a robust, scalable solution for advancing GPCR-focused therapeutic development.&lt;/p&gt;&lt;p&gt;The proposed framework incorporates an innovative dual-label prediction strategy, enabling the simultaneous classification of molecules as agonists, antagonists, or both. Each prediction is further accompanied by a confidence score, offering a quantitative measure of activity likelihood. This advancement moves beyond conventional models focusing solely on binding affinity, providing a more comprehensive understanding of ligand-receptor interactions.&lt;/p&gt;&lt;p&gt;At the core of our model lies the Bi-Directional Multi-Head Cross-Attention (BMCA) module, a novel architecture that captures forward and backward contextual embeddings of protein and ligand features. By leveraging BMCA, the model effectively integrates structural and sequence-level information, ensuring a precise representation of molecular interactions. Results show that this approach is highly accurate in binding affini","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00945-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143056284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
hERGAT: predicting hERG blockers using graph attention mechanism through atom- and molecule-level interaction analyses
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-28 DOI: 10.1186/s13321-025-00957-x
Dohyeon Lee, Sunyong Yoo

The human ether-a-go-go-related gene (hERG) channel plays a critical role in the electrical activity of the heart, and its blockers can cause serious cardiotoxic effects. Thus, screening for hERG channel blockers is a crucial step in the drug development process. Many in silico models have been developed to predict hERG blockers, which can efficiently save time and resources. However, previous methods have found it hard to achieve high performance and to interpret the predictive results. To overcome these challenges, we have proposed hERGAT, a graph neural network model with an attention mechanism, to consider compound interactions on atomic and molecular levels. In the atom-level interaction analysis, we applied a graph attention mechanism (GAT) that integrates information from neighboring nodes and their extended connections. The hERGAT employs a gated recurrent unit (GRU) with the GAT to learn information between more distant atoms. To confirm this, we performed clustering analysis and visualized a correlation heatmap, verifying the interactions between distant atoms were considered during the training process. In the molecule-level interaction analysis, the attention mechanism enables the target node to focus on the most relevant information, highlighting the molecular substructures that play crucial roles in predicting hERG blockers. Through a literature review, we confirmed that highlighted substructures have a significant role in determining the chemical and biological characteristics related to hERG activity. Furthermore, we integrated physicochemical properties into our hERGAT model to improve the performance. Our model achieved an area under the receiver operating characteristic of 0.907 and an area under the precision-recall of 0.904, demonstrating its effectiveness in modeling hERG activity and offering a reliable framework for optimizing drug safety in early development stages.

Scientific contribution:

hERGAT is a deep learning model for predicting hERG blockers by combining GAT and GRU, enabling it to capture complex interactions at atomic and molecular levels. We improve the model's interpretability by analyzing the highlighted molecular substructures, providing valuable insights into their roles in determining hERG activity. The model achieves high predictive performance, confirming its potential as a preliminary tool for early cardiotoxicity assessment and enhancing the reliability of the results.

{"title":"hERGAT: predicting hERG blockers using graph attention mechanism through atom- and molecule-level interaction analyses","authors":"Dohyeon Lee,&nbsp;Sunyong Yoo","doi":"10.1186/s13321-025-00957-x","DOIUrl":"10.1186/s13321-025-00957-x","url":null,"abstract":"<div><p>The human ether-a-go-go-related gene (hERG) channel plays a critical role in the electrical activity of the heart, and its blockers can cause serious cardiotoxic effects. Thus, screening for hERG channel blockers is a crucial step in the drug development process. Many in silico models have been developed to predict hERG blockers, which can efficiently save time and resources. However, previous methods have found it hard to achieve high performance and to interpret the predictive results. To overcome these challenges, we have proposed hERGAT, a graph neural network model with an attention mechanism, to consider compound interactions on atomic and molecular levels. In the atom-level interaction analysis, we applied a graph attention mechanism (GAT) that integrates information from neighboring nodes and their extended connections. The hERGAT employs a gated recurrent unit (GRU) with the GAT to learn information between more distant atoms. To confirm this, we performed clustering analysis and visualized a correlation heatmap, verifying the interactions between distant atoms were considered during the training process. In the molecule-level interaction analysis, the attention mechanism enables the target node to focus on the most relevant information, highlighting the molecular substructures that play crucial roles in predicting hERG blockers. Through a literature review, we confirmed that highlighted substructures have a significant role in determining the chemical and biological characteristics related to hERG activity. Furthermore, we integrated physicochemical properties into our hERGAT model to improve the performance. Our model achieved an area under the receiver operating characteristic of 0.907 and an area under the precision-recall of 0.904, demonstrating its effectiveness in modeling hERG activity and offering a reliable framework for optimizing drug safety in early development stages.</p><p><b>Scientific contribution:</b></p><p>hERGAT is a deep learning model for predicting hERG blockers by combining GAT and GRU, enabling it to capture complex interactions at atomic and molecular levels. We improve the model's interpretability by analyzing the highlighted molecular substructures, providing valuable insights into their roles in determining hERG activity. The model achieves high predictive performance, confirming its potential as a preliminary tool for early cardiotoxicity assessment and enhancing the reliability of the results.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00957-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143055009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The algebraic extended atom-type graph-based model for precise ligand–receptor binding affinity prediction 基于代数扩展原子型图的配体-受体结合亲和力精确预测模型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-22 DOI: 10.1186/s13321-025-00955-z
Farjana Tasnim Mukta, Md Masud Rana, Avery Meyer, Sally Ellingson, Duc D. Nguyen

Accurate prediction of ligand-receptor binding affinity is crucial in structure-based drug design, significantly impacting the development of effective drugs. Recent advances in machine learning (ML)–based scoring functions have improved these predictions, yet challenges remain in modeling complex molecular interactions. This study introduces the AGL-EAT-Score, a scoring function that integrates extended atom-type multiscale weighted colored subgraphs with algebraic graph theory. This approach leverages the eigenvalues and eigenvectors of graph Laplacian and adjacency matrices to capture high-level details of specific atom pairwise interactions. Evaluated against benchmark datasets such as CASF-2016, CASF-2013, and the Cathepsin S dataset, the AGL-EAT-Score demonstrates notable accuracy, outperforming existing traditional and ML-based methods. The model’s strength lies in its comprehensive similarity analysis, examining protein sequence, ligand structure, and binding site similarities, thus ensuring minimal bias and over-representation in the training sets. The use of extended atom types in graph coloring enhances the model’s capability to capture the intricacies of protein-ligand interactions. The AGL-EAT-Score marks a significant advancement in drug design, offering a tool that could potentially refine and accelerate the drug discovery process.

Scientific Contribution

The AGL-EAT-Score presents an algebraic graph-based framework that predicts ligand-receptor binding affinity by constructing multiscale weighted colored subgraphs from the 3D structure of protein-ligand complexes. It improves prediction accuracy by modeling interactions between extended atom types, addressing challenges like dataset bias and over-representation. Benchmark evaluations demonstrate that AGL-EAT-Score outperforms existing methods, offering a robust and systematic tool for structure-based drug design.

准确预测配体-受体结合亲和力在基于结构的药物设计中至关重要,对有效药物的开发具有重要影响。基于机器学习(ML)的评分功能的最新进展改进了这些预测,但在复杂分子相互作用的建模方面仍然存在挑战。本文介绍了一种将扩展原子型多尺度加权彩色子图与代数图理论相结合的评分函数AGL-EAT-Score。该方法利用图拉普拉斯矩阵和邻接矩阵的特征值和特征向量来捕获特定原子成对相互作用的高级细节。通过对CASF-2016、CASF-2013和Cathepsin S等基准数据集的评估,AGL-EAT-Score显示出显著的准确性,优于现有的传统方法和基于ml的方法。该模型的优势在于其全面的相似性分析,检测蛋白质序列、配体结构和结合位点的相似性,从而确保训练集中最小的偏差和过度表征。在图形着色中使用扩展原子类型增强了模型捕捉蛋白质-配体相互作用的复杂性的能力。AGL-EAT-Score标志着药物设计的重大进步,提供了一种可能改进和加速药物发现过程的工具。AGL-EAT-Score提供了一个基于代数图的框架,通过从蛋白质-配体复合物的3D结构构建多尺度加权彩色子图来预测配体-受体结合亲和力。它通过建模扩展原子类型之间的相互作用来提高预测精度,解决了数据集偏差和过度表示等挑战。基准评估表明,AGL-EAT-Score优于现有方法,为基于结构的药物设计提供了一个强大而系统的工具。
{"title":"The algebraic extended atom-type graph-based model for precise ligand–receptor binding affinity prediction","authors":"Farjana Tasnim Mukta,&nbsp;Md Masud Rana,&nbsp;Avery Meyer,&nbsp;Sally Ellingson,&nbsp;Duc D. Nguyen","doi":"10.1186/s13321-025-00955-z","DOIUrl":"10.1186/s13321-025-00955-z","url":null,"abstract":"<div><p>Accurate prediction of ligand-receptor binding affinity is crucial in structure-based drug design, significantly impacting the development of effective drugs. Recent advances in machine learning (ML)–based scoring functions have improved these predictions, yet challenges remain in modeling complex molecular interactions. This study introduces the AGL-EAT-Score, a scoring function that integrates extended atom-type multiscale weighted colored subgraphs with algebraic graph theory. This approach leverages the eigenvalues and eigenvectors of graph Laplacian and adjacency matrices to capture high-level details of specific atom pairwise interactions. Evaluated against benchmark datasets such as CASF-2016, CASF-2013, and the Cathepsin S dataset, the AGL-EAT-Score demonstrates notable accuracy, outperforming existing traditional and ML-based methods. The model’s strength lies in its comprehensive similarity analysis, examining protein sequence, ligand structure, and binding site similarities, thus ensuring minimal bias and over-representation in the training sets. The use of extended atom types in graph coloring enhances the model’s capability to capture the intricacies of protein-ligand interactions. The AGL-EAT-Score marks a significant advancement in drug design, offering a tool that could potentially refine and accelerate the drug discovery process.</p><p><b>Scientific Contribution</b></p><p> The AGL-EAT-Score presents an algebraic graph-based framework that predicts ligand-receptor binding affinity by constructing multiscale weighted colored subgraphs from the 3D structure of protein-ligand complexes. It improves prediction accuracy by modeling interactions between extended atom types, addressing challenges like dataset bias and over-representation. Benchmark evaluations demonstrate that AGL-EAT-Score outperforms existing methods, offering a robust and systematic tool for structure-based drug design.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00955-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StreamChol: a web-based application for predicting cholestasis StreamChol:一个基于网络的预测胆汁淤积的应用程序
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-21 DOI: 10.1186/s13321-024-00943-9
Pablo Rodríguez-Belenguer, Emilio Soria-Olivas, Manuel Pastor

This article introduces StreamChol, a software for developing and applying mechanistic models to predict cholestasis. StreamChol is a Streamlit application, usable as a desktop application or web-accessible software when installed on a server using a docker container.

StreamChol allows a seamless integration of pharmacokinetic analyses with Machine Learning models. This integration not only enables cholestasis prediction but also opens avenues for predicting other toxicological endpoints requiring similar integrations. StreamChol's Docker containerization also streamlines deployment across diverse environments, addressing potential compatibility issues. StreamChol is distributed as open-source under GNU GPL v3, reflecting our commitment to open science. Through StreamChol, researchers are offered a potent tool for predictive modelling in toxicology, harnessing its strengths within an intuitive and user-friendly interface, without the need for any programming knowledge.

Scientific contribution This work offers a user-friendly web-based tool for cholestasis prediction and a complete workflow for creating web platforms that require the combination of both programming languages, R and Python.

本文介绍了一款用于开发和应用机制模型预测胆汁淤积的软件StreamChol。StreamChol是一个streamlight应用程序,当使用docker容器安装在服务器上时,可以作为桌面应用程序或web访问软件使用。StreamChol允许药物动力学分析与机器学习模型的无缝集成。这种整合不仅能够预测胆汁淤积,而且还为预测需要类似整合的其他毒理学终点开辟了途径。StreamChol的Docker容器化还简化了跨不同环境的部署,解决了潜在的兼容性问题。StreamChol在GNU GPL v3下作为开源发布,反映了我们对开放科学的承诺。通过StreamChol,研究人员为毒理学预测建模提供了一个强大的工具,在直观和用户友好的界面中利用其优势,无需任何编程知识。这项工作为胆汁淤积预测提供了一个用户友好的基于web的工具,并为创建需要R和Python两种编程语言组合的web平台提供了一个完整的工作流程。
{"title":"StreamChol: a web-based application for predicting cholestasis","authors":"Pablo Rodríguez-Belenguer,&nbsp;Emilio Soria-Olivas,&nbsp;Manuel Pastor","doi":"10.1186/s13321-024-00943-9","DOIUrl":"10.1186/s13321-024-00943-9","url":null,"abstract":"<div><p>This article introduces StreamChol, a software for developing and applying mechanistic models to predict cholestasis. StreamChol is a Streamlit application, usable as a desktop application or web-accessible software when installed on a server using a docker container.</p><p>StreamChol allows a seamless integration of pharmacokinetic analyses with Machine Learning models. This integration not only enables cholestasis prediction but also opens avenues for predicting other toxicological endpoints requiring similar integrations. StreamChol's Docker containerization also streamlines deployment across diverse environments, addressing potential compatibility issues. StreamChol is distributed as open-source under GNU GPL v3, reflecting our commitment to open science. Through StreamChol, researchers are offered a potent tool for predictive modelling in toxicology, harnessing its strengths within an intuitive and user-friendly interface, without the need for any programming knowledge.</p><p><b>Scientific contribution </b> This work offers a user-friendly web-based tool for cholestasis prediction and a complete workflow for creating web platforms that require the combination of both programming languages, R and Python.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00943-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142990748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Matched pairs demonstrate robustness against inter-assay variability 配对对对测定间变异性具有稳健性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-20 DOI: 10.1186/s13321-025-00956-y
Jochem Nelen, Horacio Pérez-Sánchez, Hans De Winter, Dries Van Rompaey

Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences between compounds are often assumed to be consistent. This study evaluates that assumption by analyzing potency differences between matched compound pairs across assays and assessing the impact of assay metadata curation on error reduction. We find that potency differences between matched pairs exhibit less variability than individual compound measurements, suggesting systematic assay differences may partially cancel out in paired data. Metadata curation further improves inter-assay agreement, albeit at the cost of dataset size. For minimally curated compound pairs, agreement within 0.3 pChEMBL units was found to be 44–46% for Ki and IC50 values respectively, which improved to 66–79% after curation. Similarly, the percentage of pairs with differences exceeding 1 pChEMBL unit dropped from 12 to 15% to 6–8% with extensive curation. These results establish a benchmark for expected noise in matched molecular pair data from the ChEMBL database, offering practical metrics for data quality assessment.

化学的机器学习模型需要大型数据集,通常通过组合来自多个分析的数据来编译。然而,在没有仔细管理的情况下合并数据可能会带来明显的噪音。虽然不同测定的绝对值很少具有可比性,但通常假定化合物之间的趋势或差异是一致的。本研究通过分析不同测定中匹配化合物对之间的效价差异和测定元数据管理对减少误差的影响来评估这一假设。我们发现配对对之间的效价差异表现出比单个化合物测量更小的可变性,这表明系统分析差异可能部分抵消配对数据。元数据管理进一步提高了分析间的一致性,尽管以数据集大小为代价。对于最少筛选的化合物对,在0.3个pChEMBL单位内,Ki和IC50值的一致性分别为44-46%,筛选后提高到66-79%。同样,在广泛筛选后,差异超过1个pChEMBL单位的配对百分比从12 - 15%下降到6-8%。这些结果为ChEMBL数据库中匹配分子对数据的预期噪声建立了基准,为数据质量评估提供了实用指标。
{"title":"Matched pairs demonstrate robustness against inter-assay variability","authors":"Jochem Nelen,&nbsp;Horacio Pérez-Sánchez,&nbsp;Hans De Winter,&nbsp;Dries Van Rompaey","doi":"10.1186/s13321-025-00956-y","DOIUrl":"10.1186/s13321-025-00956-y","url":null,"abstract":"<div><p>Machine learning models for chemistry require large datasets, often compiled by combining data from multiple assays. However, combining data without careful curation can introduce significant noise. While absolute values from different assays are rarely comparable, trends or differences between compounds are often assumed to be consistent. This study evaluates that assumption by analyzing potency differences between matched compound pairs across assays and assessing the impact of assay metadata curation on error reduction. We find that potency differences between matched pairs exhibit less variability than individual compound measurements, suggesting systematic assay differences may partially cancel out in paired data. Metadata curation further improves inter-assay agreement, albeit at the cost of dataset size. For minimally curated compound pairs, agreement within 0.3 pChEMBL units was found to be 44–46% for K<sub>i</sub> and IC<sub>50</sub> values respectively, which improved to 66–79% after curation. Similarly, the percentage of pairs with differences exceeding 1 pChEMBL unit dropped from 12 to 15% to 6–8% with extensive curation. These results establish a benchmark for expected noise in matched molecular pair data from the ChEMBL database, offering practical metrics for data quality assessment.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00956-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142990138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chemical space as a unifying theme for chemistry 化学空间作为化学的统一主题
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-16 DOI: 10.1186/s13321-025-00954-0
Jean-Louis Reymond

Chemistry has diversified from a basic understanding of the elements to studying millions of highly diverse molecules and materials, which together are conceptualized as the chemical space. A map of this chemical space where distances represent similarities between compounds can represent the mutual relationships between different subfields of chemistry and help the discipline to be viewed and understood globally.

化学已经从对元素的基本理解发展到对数百万高度多样化的分子和材料的研究,这些分子和材料一起被概念化为化学空间。这个化学空间的地图,其中距离表示化合物之间的相似性,可以表示化学不同子领域之间的相互关系,并有助于在全球范围内观察和理解该学科。
{"title":"Chemical space as a unifying theme for chemistry","authors":"Jean-Louis Reymond","doi":"10.1186/s13321-025-00954-0","DOIUrl":"10.1186/s13321-025-00954-0","url":null,"abstract":"<div><p>Chemistry has diversified from a basic understanding of the elements to studying millions of highly diverse molecules and materials, which together are conceptualized as the chemical space. A map of this chemical space where distances represent similarities between compounds can represent the mutual relationships between different subfields of chemistry and help the discipline to be viewed and understood globally.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00954-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142987640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One size does not fit all: revising traditional paradigms for assessing accuracy of QSAR models used for virtual screening 一个尺寸不适合所有:修订用于虚拟筛选的QSAR模型评估准确性的传统范式
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-01-16 DOI: 10.1186/s13321-025-00948-y
James Wellnitz, Sankalp Jain, Joshua E. Hochuli, Travis Maxfield, Eugene N. Muratov, Alexander Tropsha, Alexey V. Zakharov

Traditional best practices for quantitative structure activity relationship (QSAR) modeling recommend dataset balancing and balanced accuracy (BA) as the key desired objective of model development. This study explores the value of the conventional norms in the context of using QSAR models for virtual screening of modern large and ultra-large chemical libraries. For this increasingly common task, we now recommend the use of models with the highest positive predictive value (PPV) built on imbalanced training sets as preferred virtual screening tools. This recommendation stems from practical considerations of how the results of virtual screening are used in experimental laboratories where only a small fraction of virtually screened molecules can be tested using standard well plates. As a proof of concept, we have developed QSAR models for five expansive datasets with different ratios of active and inactive molecules and compared model performance in virtual screening using BA, PPV, and other metrics. We show that training on imbalanced datasets achieves a hit rate at least 30% higher than using balanced datasets, and that the PPV metric captured this difference of performance with no parameter tuning. Importantly, hit rates were estimated for top scoring compounds organized in batches of the size of plates (for instance, 128 molecules) used in the experimental high throughput screening. Based on the results of our studies, we posit that QSAR models trained on imbalanced datasets with the highest PPV should be relied upon to identify and test hit compounds in early drug discovery studies.

定量结构活动关系(QSAR)建模的传统最佳实践建议将数据集平衡和平衡精度(BA)作为模型开发的关键期望目标。本研究探讨了传统规范在使用QSAR模型进行现代大型和超大型化学文库虚拟筛选的背景下的价值。对于这个日益普遍的任务,我们现在推荐使用基于不平衡训练集的具有最高正预测值(PPV)的模型作为首选的虚拟筛选工具。这一建议源于对实验实验室如何使用虚拟筛选结果的实际考虑,在实验实验室中,只有一小部分虚拟筛选的分子可以使用标准孔板进行测试。为了验证这一概念,我们为5个具有不同活性和非活性分子比例的扩展数据集开发了QSAR模型,并使用BA、PPV和其他指标比较了模型在虚拟筛选中的性能。我们表明,在不平衡数据集上进行训练的命中率至少比使用平衡数据集高30%,并且PPV指标在没有参数调优的情况下捕获了这种性能差异。重要的是,在实验高通量筛选中,以板大小的批次(例如,128个分子)组织的得分最高的化合物的命中率被估计。基于我们的研究结果,我们假设在具有最高PPV的不平衡数据集上训练的QSAR模型应该依赖于识别和测试早期药物发现研究中的击中化合物。
{"title":"One size does not fit all: revising traditional paradigms for assessing accuracy of QSAR models used for virtual screening","authors":"James Wellnitz,&nbsp;Sankalp Jain,&nbsp;Joshua E. Hochuli,&nbsp;Travis Maxfield,&nbsp;Eugene N. Muratov,&nbsp;Alexander Tropsha,&nbsp;Alexey V. Zakharov","doi":"10.1186/s13321-025-00948-y","DOIUrl":"10.1186/s13321-025-00948-y","url":null,"abstract":"<div><p>Traditional best practices for quantitative structure activity relationship (QSAR) modeling recommend dataset balancing and balanced accuracy (BA) as the key desired objective of model development. This study explores the value of the conventional norms in the context of using QSAR models for virtual screening of modern large and ultra-large chemical libraries. For this increasingly common task, we now recommend the use of models with the highest positive predictive value (PPV) built on imbalanced training sets as preferred virtual screening tools. This recommendation stems from practical considerations of how the results of virtual screening are used in experimental laboratories where only a small fraction of virtually screened molecules can be tested using standard well plates. As a proof of concept, we have developed QSAR models for five expansive datasets with different ratios of active and inactive molecules and compared model performance in virtual screening using BA, PPV, and other metrics. We show that training on imbalanced datasets achieves a hit rate at least 30% higher than using balanced datasets, and that the PPV metric captured this difference of performance with no parameter tuning. Importantly, hit rates were estimated for top scoring compounds organized in batches of the size of plates (for instance, 128 molecules) used in the experimental high throughput screening. Based on the results of our studies, we posit that QSAR models trained on imbalanced datasets with the highest PPV should be relied upon to identify and test hit compounds in early drug discovery studies.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00948-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142987639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1