Pub Date : 2025-09-29DOI: 10.1186/s13321-025-01091-4
Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, Peter W. J. Staar
Automatic extraction of molecules from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of molecule and Markush structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting fingerprints directly from images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables the retrieval of molecules and Markush structures. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecule and Markush structure depictions. The benchmark datasets, models, and inference code are publicly available..
{"title":"Subgrapher: visual fingerprinting of chemical structures","authors":"Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, Peter W. J. Staar","doi":"10.1186/s13321-025-01091-4","DOIUrl":"10.1186/s13321-025-01091-4","url":null,"abstract":"<div><p>Automatic extraction of molecules from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of molecule and Markush structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting fingerprints directly from images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables the retrieval of molecules and Markush structures. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecule and Markush structure depictions. The benchmark datasets, models, and inference code are publicly available..</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01091-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-29DOI: 10.1186/s13321-025-01076-3
Friedrich Hastedt, Klaus Hellgardt, Sophia Yaliraki, Dongda Zhang, Antonio del Rio Chanona
Machine learning approaches for conceptualizing and designing in silico compounds have attracted significant attention. However, the applicability of these compounds is often challenged by synthetic viability and cost-effectiveness. Researchers introduced proxy-scores, known as synthethic accessiblity scoring, to quantify the ease of synthesis for virtual molecules. Despite their utility, existing synthetic accessibility tools have notable limitations: they overlook compound purchasability, lack physical interpretability, and often rely on imperfect computer-aided synthesis planning algorithms. We introduce MolPrice, an accurate and fast model for molecular price prediction. Utilizing self-supervised contrastive learning, MolPrice autonomously generates price labels for synthetically complex molecules, enabling the model to generalize to molecules beyond the training distribution. Our results show that MolPrice reliably assigns higher prices to synthetically complex molecules than to readily purchasable ones, effectively distinguishing different levels of synthetic accessibility. Furthermore, MolPrice achieves competitive performance on literature benchmarks for synthetic accessibility. To demonstrate its practical utility, we conduct a virtual screening case study, illustrating how MolPrice successfully identifies purchasable molecules from a large candidate library. MolPrice bridges the gap between generative molecular design and real-world feasibility by integrating cost-awareness into synthetic accessibility assessment, making it a powerful model to accelerate molecular discovery.
{"title":"MolPrice: assessing synthetic accessibility of molecules based on market value","authors":"Friedrich Hastedt, Klaus Hellgardt, Sophia Yaliraki, Dongda Zhang, Antonio del Rio Chanona","doi":"10.1186/s13321-025-01076-3","DOIUrl":"10.1186/s13321-025-01076-3","url":null,"abstract":"<div><p>Machine learning approaches for conceptualizing and designing in silico compounds have attracted significant attention. However, the applicability of these compounds is often challenged by synthetic viability and cost-effectiveness. Researchers introduced proxy-scores, known as synthethic accessiblity scoring, to quantify the ease of synthesis for virtual molecules. Despite their utility, existing synthetic accessibility tools have notable limitations: they overlook compound purchasability, lack physical interpretability, and often rely on imperfect computer-aided synthesis planning algorithms. We introduce <i>MolPrice</i>, an accurate and fast model for molecular price prediction. Utilizing self-supervised contrastive learning, <i>MolPrice</i> autonomously generates price labels for synthetically complex molecules, enabling the model to generalize to molecules beyond the training distribution. Our results show that <i>MolPrice</i> reliably assigns higher prices to synthetically complex molecules than to readily purchasable ones, effectively distinguishing different levels of synthetic accessibility. Furthermore, <i>MolPrice</i> achieves competitive performance on literature benchmarks for synthetic accessibility. To demonstrate its practical utility, we conduct a virtual screening case study, illustrating how <i>MolPrice</i> successfully identifies purchasable molecules from a large candidate library. <i>MolPrice</i> bridges the gap between generative molecular design and real-world feasibility by integrating cost-awareness into synthetic accessibility assessment, making it a powerful model to accelerate molecular discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01076-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145188879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-26DOI: 10.1186/s13321-025-01087-0
Tong Luo, Zheng Zhang, Xian-gan Chen, Zhi Li
Compared to monotherapy, drug combinations exhibit stronger efficacy, fewer side effects, and lower drug resistance in cancer treatment. However, traditional wet-lab methods for screening synergistic drug combinations are both costly and inefficient. Lately, the development of various drug synergy methods has been promoted by the emergence of multiple drug synergy databases. Many of these methods use multimodal data and achieve good results. However, if various modalities of data is given equal consideration without taking into account the differences in features between the two modalities, this may lead to less effective multi-modal learning. We propose a multi-modal contrastive learning method for drug synergy prediction, named MCDSP. Specifically, MCDSP extracts entity embedding features of drugs and cell lines from heterogeneous graphs, while leveraging molecular fingerprints and gene expression features as biomolecular features for drugs and cell lines. These two different types of features serve as two types of modality information. Under the guided of single modality prediction tasks, we evaluated the relevant information of each modality. Through contrastive learning, the prediction bias of the two modalities are reduced, which obtain improved quality of multi-modal feature. Experiments show that MCDSP outperforms baseline methods on large datasets, and it performs well in handling unknown drug combinations and cell lines. MCDSP has demonstrated significant effectiveness in predicting drug synergy.
{"title":"Multi-modal contrastive drug synergy prediction model guided by single modality","authors":"Tong Luo, Zheng Zhang, Xian-gan Chen, Zhi Li","doi":"10.1186/s13321-025-01087-0","DOIUrl":"10.1186/s13321-025-01087-0","url":null,"abstract":"<div><p>Compared to monotherapy, drug combinations exhibit stronger efficacy, fewer side effects, and lower drug resistance in cancer treatment. However, traditional wet-lab methods for screening synergistic drug combinations are both costly and inefficient. Lately, the development of various drug synergy methods has been promoted by the emergence of multiple drug synergy databases. Many of these methods use multimodal data and achieve good results. However, if various modalities of data is given equal consideration without taking into account the differences in features between the two modalities, this may lead to less effective multi-modal learning. We propose a multi-modal contrastive learning method for drug synergy prediction, named MCDSP. Specifically, MCDSP extracts entity embedding features of drugs and cell lines from heterogeneous graphs, while leveraging molecular fingerprints and gene expression features as biomolecular features for drugs and cell lines. These two different types of features serve as two types of modality information. Under the guided of single modality prediction tasks, we evaluated the relevant information of each modality. Through contrastive learning, the prediction bias of the two modalities are reduced, which obtain improved quality of multi-modal feature. Experiments show that MCDSP outperforms baseline methods on large datasets, and it performs well in handling unknown drug combinations and cell lines. MCDSP has demonstrated significant effectiveness in predicting drug synergy.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01087-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-26DOI: 10.1186/s13321-025-01066-5
Srijit Seal, Maria-Anna Trapotsi, Manas Mahale, Vigneshwari Subramanian, Nigel Greene, Ola Spjuth, Andreas Bender
Drug exposure, a key determinant of drug safety and efficacy, is governed by pharmacokinetic (PK) parameters such as volume of distribution (VDss), clearance (CL), half-life (t½), fraction unbound in plasma (fu), and mean residence time (MRT). In this study, we developed machine learning models to predict human PK parameters for 1,283 unique compounds using molecular structure, physicochemical properties, and predicted animal PK data. Our approach involved a two-stage modeling pipeline. First, we trained models to predict rat, dog, and monkey PK parameters (VDss, CL, fu) from chemical structure and properties for 371 compounds. These models were used to predict animal PK values for 1,283 unique compounds with human PK data. These animal PK predictions were then integrated with molecular descriptors and fingerprints to build Random Forest models for human PK parameters. The models demonstrated consistent performance across nested cross-validation and external validation sets, with predictive accuracy for VDss comparable to proprietary models developed by AstraZeneca. Notably, human VDss and CL predictions achieved external R2 values of 0.39 and 0.46, respectively. To support broad accessibility and integration into early drug discovery workflows such as Design-Make-Test-Analyze (DMTA), we developed PKSmart (https://broad.io/PKSmart), a freely available web application. All code and models are also open source, enabling local deployment. To our knowledge, this represents the first public suite of PK prediction models with performance on par with industry standard models.
This study introduces the first publicly available pharmacokinetic (PK) models that match industry-standard predictions, utilizing molecular structural fingerprints, physicochemical properties, and predicted animal PK data to model human pharmacokinetics. Our approach is validated through repeated nested cross-validation and an external test set, including comparing predictions to an industry standard model. The models are released via a web-hosted application (https://broad.io/PKSmart) for wider accessibility and utility in drug development processes.
{"title":"PKSmart: an open-source computational model to predict intravenous pharmacokinetics of small molecules","authors":"Srijit Seal, Maria-Anna Trapotsi, Manas Mahale, Vigneshwari Subramanian, Nigel Greene, Ola Spjuth, Andreas Bender","doi":"10.1186/s13321-025-01066-5","DOIUrl":"10.1186/s13321-025-01066-5","url":null,"abstract":"<p>Drug exposure, a key determinant of drug safety and efficacy, is governed by pharmacokinetic (PK) parameters such as volume of distribution (VDss), clearance (CL), half-life (t½), fraction unbound in plasma (fu), and mean residence time (MRT). In this study, we developed machine learning models to predict human PK parameters for 1,283 unique compounds using molecular structure, physicochemical properties, and predicted animal PK data. Our approach involved a two-stage modeling pipeline. First, we trained models to predict rat, dog, and monkey PK parameters (VDss, CL, fu) from chemical structure and properties for 371 compounds. These models were used to predict animal PK values for 1,283 unique compounds with human PK data. These animal PK predictions were then integrated with molecular descriptors and fingerprints to build Random Forest models for human PK parameters. The models demonstrated consistent performance across nested cross-validation and external validation sets, with predictive accuracy for VDss comparable to proprietary models developed by AstraZeneca. Notably, human VDss and CL predictions achieved external R<sup>2</sup> values of 0.39 and 0.46, respectively. To support broad accessibility and integration into early drug discovery workflows such as Design-Make-Test-Analyze (DMTA), we developed PKSmart (https://broad.io/PKSmart), a freely available web application. All code and models are also open source, enabling local deployment. To our knowledge, this represents the first public suite of PK prediction models with performance on par with industry standard models.</p><p>This study introduces the first publicly available pharmacokinetic (PK) models that match industry-standard predictions, utilizing molecular structural fingerprints, physicochemical properties, and predicted animal PK data to model human pharmacokinetics. Our approach is validated through repeated nested cross-validation and an external test set, including comparing predictions to an industry standard model. The models are released via a web-hosted application (https://broad.io/PKSmart) for wider accessibility and utility in drug development processes.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01066-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-25DOI: 10.1186/s13321-025-01067-4
Jordi Gómez Borrego, Marc Torrent Burgas
Advances in docking protocols have significantly enhanced the field of protein–protein interaction (PPI) modulation, with AlphaFold2 (AF2) and molecular dynamics (MD) refinements playing pivotal roles. This study evaluates the performance of AF2 models against experimentally solved structures in docking protocols targeting PPIs. Using a dataset of 16 interactions with validated modulators, we benchmarked eight docking protocols, revealing similar performance between native and AF2 models. Local docking strategies outperformed blind docking, with TankBind_local and Glide providing the best results across the structural types tested. MD simulations and other ensemble generation algorithms such as AlphaFlow, refined both native and AF2 models, improving docking outcomes but showing significant variability across conformations. These results suggest that, while structural refinement can enhance docking in some cases, overall performance appears to be constrained by limitations in scoring functions and docking methodologies. Although protein ensembles can improve virtual screening, predicting the most effective conformations for docking remains a challenge. These findings support the use of AF2-generated structures in docking protocols targeting PPIs and highlight the need to improve current scoring methodologies.
This study provides a systematic benchmark of docking protocols applied to protein–proteininteractions (PPIs) using both experimentally solved structures and AlphaFold2 models. Byintegrating molecular dynamics ensembles and AlphaFlow-generated conformations, we showthat structural refinement improves docking outcomes in selected cases, but overallperformance remains constrained by docking scoring function limitations. Our analysis showsthat AlphaFold2 models perform comparably to native structures in PPI docking, validating theiruse when experimental data are unavailable. These results establish a reference framework forfuture PPI-focused virtual screening and underscore the need for improved scoring functionsand ensemble-based approaches to better exploit emerging structural prediction tools.
{"title":"Evaluating ligand docking methods for drugging protein–protein interfaces: insights from AlphaFold2 and molecular dynamics refinement","authors":"Jordi Gómez Borrego, Marc Torrent Burgas","doi":"10.1186/s13321-025-01067-4","DOIUrl":"10.1186/s13321-025-01067-4","url":null,"abstract":"<p>Advances in docking protocols have significantly enhanced the field of protein–protein interaction (PPI) modulation, with AlphaFold2 (AF2) and molecular dynamics (MD) refinements playing pivotal roles. This study evaluates the performance of AF2 models against experimentally solved structures in docking protocols targeting PPIs. Using a dataset of 16 interactions with validated modulators, we benchmarked eight docking protocols, revealing similar performance between native and AF2 models. Local docking strategies outperformed blind docking, with TankBind_local and Glide providing the best results across the structural types tested. MD simulations and other ensemble generation algorithms such as AlphaFlow, refined both native and AF2 models, improving docking outcomes but showing significant variability across conformations. These results suggest that, while structural refinement can enhance docking in some cases, overall performance appears to be constrained by limitations in scoring functions and docking methodologies. Although protein ensembles can improve virtual screening, predicting the most effective conformations for docking remains a challenge. These findings support the use of AF2-generated structures in docking protocols targeting PPIs and highlight the need to improve current scoring methodologies.</p><p>This study provides a systematic benchmark of docking protocols applied to protein–proteininteractions (PPIs) using both experimentally solved structures and AlphaFold2 models. Byintegrating molecular dynamics ensembles and AlphaFlow-generated conformations, we showthat structural refinement improves docking outcomes in selected cases, but overallperformance remains constrained by docking scoring function limitations. Our analysis showsthat AlphaFold2 models perform comparably to native structures in PPI docking, validating theiruse when experimental data are unavailable. These results establish a reference framework forfuture PPI-focused virtual screening and underscore the need for improved scoring functionsand ensemble-based approaches to better exploit emerging structural prediction tools.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01067-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145133533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-25DOI: 10.1186/s13321-025-01084-3
Fabian Liessmann, Paul Eisenhuth, Alexander Fürll, Oanh Vu, Rocco Moretti, Jens Meiler
In this study, we present a pipeline for identifying novel ligands targeting the Tryptophan-Aspartate-Repeat domain 40 (WDR40) of Leucine-Rich Repeat Kinase 2 (LRRK2), a protein associated with Parkinson’s disease, as part of the first Critical Assessment of Computational Hit-finding Experiments (CACHE) challenge, a blind benchmark experiment for drug discovery. Mutations in this protein are the most common genetic cause of familial Parkinson’s disease, yet this target remains understudied. We conducted an ultra-large library screening (ULLS) of the Enamine REAL space using a newly developed evolutionary algorithm, RosettaEvolutionaryLigand (REvoLd), which allows for efficient screening of combinatorial compound libraries. The protocol involved refining the target structure with molecular dynamic simulations, identifying a binding site via blind-docking, and optimizing compounds through REvoLd, culminating in a manual selection amongst the top-scoring REvoLd hits. A single binder molecule was identified that derived from the combination of two Enamine building blocks. In the second round, derivatives of the hit compound were used as input for REvoLd to further sample within the Enamine REAL space. Ultimately, a total of five molecules were identified, from which three show a measurable dissociation constant K(_D) value better than 150 (upmu) μm, showcasing the effectiveness of this approach. However, it also highlighted shortcomings, such as the preference for nitrogen-rich rings in the RosettaLigand scoring function.
We introduce the first real-world application for REvoLd, an evolutionary docking algorithm enabling efficient ultra-large library screening for flexible protein targets. Our approach identified novel binders for the WDR40 domain of LRRK2 within the CACHE challenge #1, representing the first prospective validation of REvoLd. Here, we present a preparation pipeline to allow exploration of a large protein pocket with unspecific binding areas, and unlike prior brute-force docking efforts, our method integrates receptor flexibility and combinatorial chemistry optimization.
{"title":"Cache: Utilizing ultra-large library screening in Rosetta to identify novel binders of the WD-repeat domain of Leucine-Rich Repeat Kinase 2","authors":"Fabian Liessmann, Paul Eisenhuth, Alexander Fürll, Oanh Vu, Rocco Moretti, Jens Meiler","doi":"10.1186/s13321-025-01084-3","DOIUrl":"10.1186/s13321-025-01084-3","url":null,"abstract":"<p>In this study, we present a pipeline for identifying novel ligands targeting the Tryptophan-Aspartate-Repeat domain 40 (WDR40) of Leucine-Rich Repeat Kinase 2 (LRRK2), a protein associated with Parkinson’s disease, as part of the first Critical Assessment of Computational Hit-finding Experiments (CACHE) challenge, a blind benchmark experiment for drug discovery. Mutations in this protein are the most common genetic cause of familial Parkinson’s disease, yet this target remains understudied. We conducted an ultra-large library screening (ULLS) of the Enamine REAL space using a newly developed evolutionary algorithm, RosettaEvolutionaryLigand (REvoLd), which allows for efficient screening of combinatorial compound libraries. The protocol involved refining the target structure with molecular dynamic simulations, identifying a binding site via blind-docking, and optimizing compounds through REvoLd, culminating in a manual selection amongst the top-scoring REvoLd hits. A single binder molecule was identified that derived from the combination of two Enamine building blocks. In the second round, derivatives of the hit compound were used as input for REvoLd to further sample within the Enamine REAL space. Ultimately, a total of five molecules were identified, from which three show a measurable dissociation constant K<span>(_D)</span> value better than 150 <span>(upmu)</span> μm, showcasing the effectiveness of this approach. However, it also highlighted shortcomings, such as the preference for nitrogen-rich rings in the RosettaLigand scoring function.</p><p>We introduce the first real-world application for REvoLd, an evolutionary docking algorithm enabling efficient ultra-large library screening for flexible protein targets. Our approach identified novel binders for the WDR40 domain of LRRK2 within the CACHE challenge #1, representing the first prospective validation of REvoLd. Here, we present a preparation pipeline to allow exploration of a large protein pocket with unspecific binding areas, and unlike prior brute-force docking efforts, our method integrates receptor flexibility and combinatorial chemistry optimization.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01084-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-23DOI: 10.1186/s13321-025-01100-6
Alec Lamens, Jürgen Bajorath
The concept of contrastive explanations originating from human reasoning is used in explainable artificial intelligence. In machine learning, contrastive explanations relate alternative prediction outcomes to each other involving the identification of features leading to opposing model decisions. We introduce a methodological framework for deriving contrastive explanations for machine learning models in chemistry to systematically generate intuitive explanations of predictions in high-dimensional feature spaces. The molecular contrastive explanations (MolCE) methodology explores alternative model decisions by generating virtual analogues of test compounds through replacements of molecular building blocks and quantifies the degree of “contrastive shifts” resulting from changes in model probability distributions. In a proof-of-concept study, MolCE was applied to explain selectivity predictions of ligands of D2-like dopamine receptor isoforms.
{"title":"Contrastive explanations for machine learning predictions in chemistry","authors":"Alec Lamens, Jürgen Bajorath","doi":"10.1186/s13321-025-01100-6","DOIUrl":"10.1186/s13321-025-01100-6","url":null,"abstract":"<div><p>The concept of contrastive explanations originating from human reasoning is used in explainable artificial intelligence. In machine learning, contrastive explanations relate alternative prediction outcomes to each other involving the identification of features leading to opposing model decisions. We introduce a methodological framework for deriving contrastive explanations for machine learning models in chemistry to systematically generate intuitive explanations of predictions in high-dimensional feature spaces. The molecular contrastive explanations (MolCE) methodology explores alternative model decisions by generating virtual analogues of test compounds through replacements of molecular building blocks and quantifies the degree of “contrastive shifts” resulting from changes in model probability distributions. In a proof-of-concept study, MolCE was applied to explain selectivity predictions of ligands of D2-like dopamine receptor isoforms.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01100-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-23DOI: 10.1186/s13321-025-01094-1
Kohulan Rajan, Venkata Chandrasekhar, Nisha Sharma, Sri Ram Sagar Kanakam, Felix Baensch, Christoph Steinbeck
The widespread adoption of open-source cheminformatics toolkits remains constrained by technical implementation barriers, including complex installation procedures, dependency management, and integration challenges. Here, we present Cheminformatics Microservice V3, a significant update to the existing platform that provides unified programmatic access to cheminformatics libraries, including RDKit, Chemistry Development Kit (CDK), and Open Babel through a RESTful API framework. This latest version features a newly developed, interactive web-based frontend built with React, providing users with an intuitive graphical interface for manipulating and analysing chemical structures. The frontend supports essential cheminformatics operations, including structure editing, PubChem database integration, batch molecular processing, and standardised InChI/RInChI identifier generation. The microservice V3 addresses critical accessibility barriers in computational chemistry by providing researchers with immediate access to analytical tools, eliminating the need for specialised technical expertise or complex software installations. This approach facilitates reproducible research workflows and broadens the utilisation of cheminformatics methodologies across interdisciplinary research communities. The platform is publicly accessible at https://app.naturalproducts.net, and the complete source code and documentation are available on GitHub.
开源化学信息学工具包的广泛采用仍然受到技术实现障碍的限制,包括复杂的安装过程、依赖管理和集成挑战。在这里,我们介绍了Cheminformatics Microservice V3,这是对现有平台的重大更新,它通过RESTful API框架提供了对化学信息学库的统一编程访问,包括RDKit、Chemistry Development Kit (CDK)和Open Babel。这个最新版本的特点是使用React构建了一个新开发的交互式基于web的前端,为用户提供了一个直观的图形界面来操作和分析化学结构。前端支持基本的化学信息学操作,包括结构编辑、PubChem数据库集成、批量分子处理和标准化的InChI/RInChI标识符生成。微服务V3通过为研究人员提供即时访问分析工具,消除了对专业技术知识或复杂软件安装的需求,解决了计算化学中关键的可访问性障碍。这种方法促进了可重复的研究工作流程,并扩大了化学信息学方法在跨学科研究社区的应用。该平台可在https://app.naturalproducts.net上公开访问,完整的源代码和文档可在GitHub上获得。
{"title":"Cheminformatics Microservice V3: a web portal for chemical structure manipulation and analysis","authors":"Kohulan Rajan, Venkata Chandrasekhar, Nisha Sharma, Sri Ram Sagar Kanakam, Felix Baensch, Christoph Steinbeck","doi":"10.1186/s13321-025-01094-1","DOIUrl":"10.1186/s13321-025-01094-1","url":null,"abstract":"<div><p>The widespread adoption of open-source cheminformatics toolkits remains constrained by technical implementation barriers, including complex installation procedures, dependency management, and integration challenges. Here, we present <i>Cheminformatics Microservice V3</i>, a significant update to the existing platform that provides unified programmatic access to cheminformatics libraries, including RDKit, Chemistry Development Kit (CDK), and Open Babel through a RESTful API framework. This latest version features a newly developed, interactive web-based frontend built with React, providing users with an intuitive graphical interface for manipulating and analysing chemical structures. The frontend supports essential cheminformatics operations, including structure editing, PubChem database integration, batch molecular processing, and standardised InChI/RInChI identifier generation. The microservice V3 addresses critical accessibility barriers in computational chemistry by providing researchers with immediate access to analytical tools, eliminating the need for specialised technical expertise or complex software installations. This approach facilitates reproducible research workflows and broadens the utilisation of cheminformatics methodologies across interdisciplinary research communities. The platform is publicly accessible at https://app.naturalproducts.net, and the complete source code and documentation are available on GitHub.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01094-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-19DOI: 10.1186/s13321-025-01093-2
Nour H. Marzouk, Sahar Selim, Mustafa Elattar, Mai S. Mabrouk, Mohamed Mysara
In drug development, managing interactions such as drug–drug, drug–disease, and drug–nutrient is critical for ensuring the safety and efficacy of pharmacological treatments. These interactions often overlap, forming a complex, interconnected landscape that necessitates accurate prediction to improve patient outcomes and support evidence-based care. Recent advances in artificial intelligence (AI), powered by large-scale datasets (e.g., DrugBank, TWOSIDES, SIDER), have significantly enhanced interaction prediction. Machine learning, deep learning, and graph-based models show great promise, but challenges persist, including data imbalance, noisy sources, Limited explainability, and underrepresentation of certain types of interactions. This systematic review of 147 studies (2018–2024) is the first to comprehensively map AI applications across major interaction types. We present a detailed taxonomy of models and datasets, emphasizing the growing roles of large language models and knowledge graphs in overcoming key limitations. Their integration—alongside explainable AI tools—enhances transparency, paving the way for AI-driven systems that proactively mitigate adverse interactions. By identifying the most promising approaches and critical research gaps, this review lays the groundwork for advancing more robust, interpretable, and personalized models for drug interaction prediction.
{"title":"A comprehensive landscape of AI applications in broad-spectrum drug interaction prediction: a systematic review","authors":"Nour H. Marzouk, Sahar Selim, Mustafa Elattar, Mai S. Mabrouk, Mohamed Mysara","doi":"10.1186/s13321-025-01093-2","DOIUrl":"10.1186/s13321-025-01093-2","url":null,"abstract":"<div><p>In drug development, managing interactions such as drug–drug, drug–disease, and drug–nutrient is critical for ensuring the safety and efficacy of pharmacological treatments. These interactions often overlap, forming a complex, interconnected landscape that necessitates accurate prediction to improve patient outcomes and support evidence-based care. Recent advances in artificial intelligence (AI), powered by large-scale datasets (e.g., DrugBank, TWOSIDES, SIDER), have significantly enhanced interaction prediction. Machine learning, deep learning, and graph-based models show great promise, but challenges persist, including data imbalance, noisy sources, Limited explainability, and underrepresentation of certain types of interactions. This systematic review of 147 studies (2018–2024) is the first to comprehensively map AI applications across major interaction types. We present a detailed taxonomy of models and datasets, emphasizing the growing roles of large language models and knowledge graphs in overcoming key limitations. Their integration—alongside explainable AI tools—enhances transparency, paving the way for AI-driven systems that proactively mitigate adverse interactions. By identifying the most promising approaches and critical research gaps, this review lays the groundwork for advancing more robust, interpretable, and personalized models for drug interaction prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01093-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145079055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-03DOI: 10.1186/s13321-025-01089-y
Jun Hyeong Park, Ri Han, Junbo Jang, Jisan Kim, Joonki Paik, Jaesung Heo, Yoonji Lee
The metabolic stability of a drug is a crucial determinant of its pharmacokinetic properties, including clearance, half-life, and oral bioavailability. Accurate predictions of metabolic stability can significantly streamline the drug discovery process. In this study, we present MetaboGNN, an advanced model for predicting liver metabolic stability based on Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). Using a high-quality dataset from the 2023 South Korea Data Challenge for Drug Discovery, which comprises 3,498 training molecules and 483 test molecules, we presented molecular structures as graphs to capture the intricate structural relationships that influence metabolic stability. A GCL-driven pretraining step was employed to enhance model generalizability by learning robust, transferable graph-level representations. Notably, incorporating interspecies differences between human liver microsomes (HLM) and mouse liver microsomes (MLM) further improved predictive accuracy, achieving Root Mean Square Error (RMSE) values of 27.91 (HLM) and 27.86 (MLM), both expressed as the percentage of parent compound remaining after a 30-min incubation. Compared to traditional approaches, MetaboGNN demonstrates superior predictive performance and highlights the importance of considering interspecies enzymatic variations. In addition, attention-based analysis identified key molecular fragments associated with metabolic stability, highlighting chemically meaningful structural determinants. These findings establish MetaboGNN as a powerful tool for metabolic stability prediction, supporting more efficient lead optimization processes in drug discovery.
{"title":"MetaboGNN: predicting liver metabolic stability with graph neural networks and cross-species data","authors":"Jun Hyeong Park, Ri Han, Junbo Jang, Jisan Kim, Joonki Paik, Jaesung Heo, Yoonji Lee","doi":"10.1186/s13321-025-01089-y","DOIUrl":"10.1186/s13321-025-01089-y","url":null,"abstract":"<div><p>The metabolic stability of a drug is a crucial determinant of its pharmacokinetic properties, including clearance, half-life, and oral bioavailability. Accurate predictions of metabolic stability can significantly streamline the drug discovery process. In this study, we present <i>MetaboGNN</i>, an advanced model for predicting liver metabolic stability based on Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). Using a high-quality dataset from the 2023 South Korea Data Challenge for Drug Discovery, which comprises 3,498 training molecules and 483 test molecules, we presented molecular structures as graphs to capture the intricate structural relationships that influence metabolic stability. A GCL-driven pretraining step was employed to enhance model generalizability by learning robust, transferable graph-level representations. Notably, incorporating interspecies differences between human liver microsomes (HLM) and mouse liver microsomes (MLM) further improved predictive accuracy, achieving Root Mean Square Error (RMSE) values of 27.91 (HLM) and 27.86 (MLM), both expressed as the percentage of parent compound remaining after a 30-min incubation. Compared to traditional approaches, <i>MetaboGNN</i> demonstrates superior predictive performance and highlights the importance of considering interspecies enzymatic variations. In addition, attention-based analysis identified key molecular fragments associated with metabolic stability, highlighting chemically meaningful structural determinants. These findings establish <i>MetaboGNN</i> as a powerful tool for metabolic stability prediction, supporting more efficient lead optimization processes in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01089-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}