Pub Date : 2026-02-05DOI: 10.1186/s13321-026-01158-w
Suwan Mao, Wenjie Tang, Li Li, Mang Jing, Yun Liu, Junjie Wang
Drug combination therapy is a well-established strategy for treating complex diseases. However, the vast combinatorial space renders exhaustive experimental screening impractical and costly. Recent studies have shown that deep learning techniques can effectively prioritize synergistic drug combinations by leveraging their powerful nonlinear modeling and automatic feature extraction capabilities. Meanwhile, Large Language Models (LLMs) offer great promise in drug discovery. In this paper, we propose CoSynLLM, an LLM-assisted predictive framework for predicting drug combination synergy. We fully leverage the latent knowledge embedded in LLMs to generate semantic-level chemical information, complemented by drug fingerprints to incorporate explicit structural details, while cell line gene expression profiles represent the cellular context. To effectively merge drug and cell line representations, a hierarchical feature fusion strategy is employed to progressively integrate features through multiple stages for predicting drug combination synergy. Extensive experiments on two benchmark datasets, NCI-ALMANAC and O'Neil, demonstrate that CoSynLLM achieves competitive performance, highlighting its effectiveness in predicting drug combination synergy. In summary, CoSynLLM effectively identifies synergistic drug combinations, offering a robust and practical computational framework for predicting drug combination synergy.
{"title":"Cosynllm: predicting drug combination synergy with LLM-generated descriptions.","authors":"Suwan Mao, Wenjie Tang, Li Li, Mang Jing, Yun Liu, Junjie Wang","doi":"10.1186/s13321-026-01158-w","DOIUrl":"https://doi.org/10.1186/s13321-026-01158-w","url":null,"abstract":"<p><p>Drug combination therapy is a well-established strategy for treating complex diseases. However, the vast combinatorial space renders exhaustive experimental screening impractical and costly. Recent studies have shown that deep learning techniques can effectively prioritize synergistic drug combinations by leveraging their powerful nonlinear modeling and automatic feature extraction capabilities. Meanwhile, Large Language Models (LLMs) offer great promise in drug discovery. In this paper, we propose CoSynLLM, an LLM-assisted predictive framework for predicting drug combination synergy. We fully leverage the latent knowledge embedded in LLMs to generate semantic-level chemical information, complemented by drug fingerprints to incorporate explicit structural details, while cell line gene expression profiles represent the cellular context. To effectively merge drug and cell line representations, a hierarchical feature fusion strategy is employed to progressively integrate features through multiple stages for predicting drug combination synergy. Extensive experiments on two benchmark datasets, NCI-ALMANAC and O'Neil, demonstrate that CoSynLLM achieves competitive performance, highlighting its effectiveness in predicting drug combination synergy. In summary, CoSynLLM effectively identifies synergistic drug combinations, offering a robust and practical computational framework for predicting drug combination synergy.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146117098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1186/s13321-026-01159-9
Darlene Nabila Zetta, Tarapong Srisongkram
Limited experimental data remains a key challenge in applying machine learning to drug discovery, particularly for cancer-related targets. In this study, we present a data-efficient active meta-deep learning framework to predict mitogen-activated protein kinase 1 (MAPK1) inhibitors, which are promising candidates for cancer-related therapies. Our approach integrates active learning (AL) with a meta-model that combines four deep architectures: a convolutional neural network, an attention, a graph convolutional network, and a graph neural network-attention, trained on molecular descriptors and graph-based representations. These models generate four probability-based features that feed into an attention-based meta-learner, improving predictive performance by 5.12% in the area under the precision-recall curve (AUPRC) and 5.48% in the Matthews correlation coefficient (MCC) using only 10% of the training data. Among the AL sampling strategies evaluated, entropy sampling showed competitive performance in selecting informative molecules for model improvement. Overall, our framework achieves an AUPRC of 0.835 ± 0.017 and MCC of 0.817 ± 0.017, on par with a traditional training method despite using only 26.7% of the training data. Compared to a conventional random forest model trained on brute-force, a 100% full training set, our approach shows a 10.6% improvement in AUPRC and modest gains in MCC, confirming the effectiveness of the proposed framework. Under severe class imbalance, balanced accuracy steadily increased across AL iterations, reaching values greater than 0.85 at the final iteration for all uncertainty-driven strategies. Molecular docking confirmed successful prioritization of the top four predicted compounds. Evaluation on an external MAPK1 data set demonstrated generalizability, with our approach achieving an AUPRC of 0.818 and an MCC of 0.403, comparable to the independent test set. These results highlight the potential of combining intelligent data selection with deep learning architectures through the meta-model to accelerate predictive performance in data-scarce drug discovery. Scientific contribution: This study contributes a novel, data-efficient active meta-deep learning framework for predicting MAPK1 inhibitors, addressing the challenge of limited experimental data in a cancer-specific target. By integrating AL with a meta-model composed of four deep architectures, the approach significantly enhances the predictive performance using only a fraction of the training data. The framework achieves superior metrics compared to traditional training methods, highlighting its potential to accelerate drug discovery in data-scarce settings.
{"title":"Data-efficient learning for accurate identification of MAPK1 inhibitors using an active meta-deep learning framework.","authors":"Darlene Nabila Zetta, Tarapong Srisongkram","doi":"10.1186/s13321-026-01159-9","DOIUrl":"https://doi.org/10.1186/s13321-026-01159-9","url":null,"abstract":"<p><p>Limited experimental data remains a key challenge in applying machine learning to drug discovery, particularly for cancer-related targets. In this study, we present a data-efficient active meta-deep learning framework to predict mitogen-activated protein kinase 1 (MAPK1) inhibitors, which are promising candidates for cancer-related therapies. Our approach integrates active learning (AL) with a meta-model that combines four deep architectures: a convolutional neural network, an attention, a graph convolutional network, and a graph neural network-attention, trained on molecular descriptors and graph-based representations. These models generate four probability-based features that feed into an attention-based meta-learner, improving predictive performance by 5.12% in the area under the precision-recall curve (AUPRC) and 5.48% in the Matthews correlation coefficient (MCC) using only 10% of the training data. Among the AL sampling strategies evaluated, entropy sampling showed competitive performance in selecting informative molecules for model improvement. Overall, our framework achieves an AUPRC of 0.835 ± 0.017 and MCC of 0.817 ± 0.017, on par with a traditional training method despite using only 26.7% of the training data. Compared to a conventional random forest model trained on brute-force, a 100% full training set, our approach shows a 10.6% improvement in AUPRC and modest gains in MCC, confirming the effectiveness of the proposed framework. Under severe class imbalance, balanced accuracy steadily increased across AL iterations, reaching values greater than 0.85 at the final iteration for all uncertainty-driven strategies. Molecular docking confirmed successful prioritization of the top four predicted compounds. Evaluation on an external MAPK1 data set demonstrated generalizability, with our approach achieving an AUPRC of 0.818 and an MCC of 0.403, comparable to the independent test set. These results highlight the potential of combining intelligent data selection with deep learning architectures through the meta-model to accelerate predictive performance in data-scarce drug discovery. Scientific contribution: This study contributes a novel, data-efficient active meta-deep learning framework for predicting MAPK1 inhibitors, addressing the challenge of limited experimental data in a cancer-specific target. By integrating AL with a meta-model composed of four deep architectures, the approach significantly enhances the predictive performance using only a fraction of the training data. The framework achieves superior metrics compared to traditional training methods, highlighting its potential to accelerate drug discovery in data-scarce settings.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1186/s13321-026-01163-z
Meiling Zhan, Xiang Li, Le Xiong, Wenxiang Song, Jiaojiao Fang, Guixia Liu, Yun Tang, Weihua Li
UDP-glucuronosyltransferases (UGTs) play a critical role in drug metabolism by catalyzing the glucuronidation of structurally diverse compounds. However, accurately predicting UGT-mediated sites of metabolism (SOMs) remains a challenge due to the limited availability of annotated data. In this study, we introduce UGTformer, a unified graph transformer-based framework that simultaneously performs UGT substrate classification and SOM prediction. UGTformer employs a hierarchical architecture integrating multi-hop message propagation with hop-aware and node-level transformer encoders. The model was pretrained on large-scale molecular graphs via chemically informed self-supervised tasks, and fine-tuned on a manually curated UGT metabolism data set covering four major metabolic reaction categories. In five-fold cross-validation, UGTformer achieved an AUC of 0.833 for substrate classification and 0.884 for SOM identification, outperforming multiple GNN baselines. On an independent external validation set, it maintained robust performance, demonstrating strong generalization to previously unseen molecules. By integrating chemically meaningful structural encodings and a joint learning paradigm, UGTformer delivers interpretable and biologically consistent predictions, offering a reliable and scalable approach for UGT-related metabolism prediction. The UGTformer model is freely accessible at https://lmmd.ecust.edu.cn/UGTformer/.
{"title":"UGTformer: a pretrained graph transformer model for predicting UDP-glucuronosyltransferase-mediated drug metabolism.","authors":"Meiling Zhan, Xiang Li, Le Xiong, Wenxiang Song, Jiaojiao Fang, Guixia Liu, Yun Tang, Weihua Li","doi":"10.1186/s13321-026-01163-z","DOIUrl":"https://doi.org/10.1186/s13321-026-01163-z","url":null,"abstract":"<p><p>UDP-glucuronosyltransferases (UGTs) play a critical role in drug metabolism by catalyzing the glucuronidation of structurally diverse compounds. However, accurately predicting UGT-mediated sites of metabolism (SOMs) remains a challenge due to the limited availability of annotated data. In this study, we introduce UGTformer, a unified graph transformer-based framework that simultaneously performs UGT substrate classification and SOM prediction. UGTformer employs a hierarchical architecture integrating multi-hop message propagation with hop-aware and node-level transformer encoders. The model was pretrained on large-scale molecular graphs via chemically informed self-supervised tasks, and fine-tuned on a manually curated UGT metabolism data set covering four major metabolic reaction categories. In five-fold cross-validation, UGTformer achieved an AUC of 0.833 for substrate classification and 0.884 for SOM identification, outperforming multiple GNN baselines. On an independent external validation set, it maintained robust performance, demonstrating strong generalization to previously unseen molecules. By integrating chemically meaningful structural encodings and a joint learning paradigm, UGTformer delivers interpretable and biologically consistent predictions, offering a reliable and scalable approach for UGT-related metabolism prediction. The UGTformer model is freely accessible at https://lmmd.ecust.edu.cn/UGTformer/.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1186/s13321-025-01148-4
Mohamed Iheb Hergli, Emna Harigua-Souiai
Discovering novel drug candidates remains a considerable challenge in pharmaceutical research. Generative AI models such as Generative Adversarial Networks (GANs) have shown considerable promise in de novo molecular generation. They demonstrated high potential in drug discovery applications, yet they often face challenges such as limited chemical coverage and mode collapse. In the present study, we developed MolGAN-QRL, a hybrid quantum-classical framework that introduced quantum-enhanced reinforcement learning within the MolGAN architecture to address these limitations. The proposed framework leveraged a hybrid reward mechanism to further optimize chemical validity, uniqueness, and drug-likeliness of the generated molecules. Experimental results demonstrated that MolGAN-QRL consistently achieved enhanced generative performances compared to classical MolGAN, with up to a 16-fold increase in the count of unique and valid generated compounds under certain conditions. These gains reflected the effectiveness of quantum-guided exploration and highlighted the known trade-off between uniqueness and validity in generative chemistry. Overall, our findings underlined the value of quantum-enhanced reward modeling in mitigating mode collapse and advancing molecular generation, and support the potential of hybrid quantum-classical methods to advance generative chemistry for drug discovery applications. SCIENTIFIC CONTRIBUTION: MolGAN-QRL introduces the first variant of the MolGAN framework, that is augmented with a variational quantum circuit (VQC) within the reinforcement-learning reward module, rather than the generator, the discriminator or the noise function. It leverages a hybrid reward mechanism that trains a quantum-classical function that lead to better mitigation of mode-collapse, through higher uniqueness scores and 16-fold more novel, valid and unique molecules generated.
{"title":"MolGAN-QRL: a hybrid framework for molecule generation using quantum-enhanced reinforcement learning.","authors":"Mohamed Iheb Hergli, Emna Harigua-Souiai","doi":"10.1186/s13321-025-01148-4","DOIUrl":"https://doi.org/10.1186/s13321-025-01148-4","url":null,"abstract":"<p><p>Discovering novel drug candidates remains a considerable challenge in pharmaceutical research. Generative AI models such as Generative Adversarial Networks (GANs) have shown considerable promise in de novo molecular generation. They demonstrated high potential in drug discovery applications, yet they often face challenges such as limited chemical coverage and mode collapse. In the present study, we developed MolGAN-QRL, a hybrid quantum-classical framework that introduced quantum-enhanced reinforcement learning within the MolGAN architecture to address these limitations. The proposed framework leveraged a hybrid reward mechanism to further optimize chemical validity, uniqueness, and drug-likeliness of the generated molecules. Experimental results demonstrated that MolGAN-QRL consistently achieved enhanced generative performances compared to classical MolGAN, with up to a 16-fold increase in the count of unique and valid generated compounds under certain conditions. These gains reflected the effectiveness of quantum-guided exploration and highlighted the known trade-off between uniqueness and validity in generative chemistry. Overall, our findings underlined the value of quantum-enhanced reward modeling in mitigating mode collapse and advancing molecular generation, and support the potential of hybrid quantum-classical methods to advance generative chemistry for drug discovery applications. SCIENTIFIC CONTRIBUTION: MolGAN-QRL introduces the first variant of the MolGAN framework, that is augmented with a variational quantum circuit (VQC) within the reinforcement-learning reward module, rather than the generator, the discriminator or the noise function. It leverages a hybrid reward mechanism that trains a quantum-classical function that lead to better mitigation of mode-collapse, through higher uniqueness scores and 16-fold more novel, valid and unique molecules generated.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1186/s13321-026-01156-y
Elier E Abreu-Martínez, Karina Martinez-Mayorga, Gabriel Merino
We developed SPARFlow, an open-source KNIME workflow for structure-activity or structure-property relationship (SAR/SPR) analyses. The workflow integrates data preprocessing, chemical structure curation, similarity network construction, maximum common substructure detection, R-group decomposition, activity cliff identification, and database modelability assessment. It implements established indices, including SALI, SARI, MODI*, and RMODI, to characterize SAR landscapes and assess dataset suitability for predictive modeling. SPARFlow was validated using four datasets with distinct chemical and endpoint characteristics: cruzain inhibitors, biased μ-opioid receptor agonists, pesticides, and carbonyl compounds with hydration constants.Scientific ContributionThis work introduces SPARFlow, an KNIME-integrated workflow that combines data curation, activity-cliff detection, and modelability assessment for SAR and SPR studies. The workflow provides a unified implementation of key SAR analyses within a single KNIME pipeline. It updates implementations of established metrics, including MODI* and RMODI, together with complementary indices such as SARI and SALI. It ensures consistent data flow across all modules.
{"title":"SPARFlow: a KNIME workflow for integrated structure-activity or structure-property relationship analysis.","authors":"Elier E Abreu-Martínez, Karina Martinez-Mayorga, Gabriel Merino","doi":"10.1186/s13321-026-01156-y","DOIUrl":"https://doi.org/10.1186/s13321-026-01156-y","url":null,"abstract":"<p><p>We developed SPARFlow, an open-source KNIME workflow for structure-activity or structure-property relationship (SAR/SPR) analyses. The workflow integrates data preprocessing, chemical structure curation, similarity network construction, maximum common substructure detection, R-group decomposition, activity cliff identification, and database modelability assessment. It implements established indices, including SALI, SARI, MODI*, and RMODI, to characterize SAR landscapes and assess dataset suitability for predictive modeling. SPARFlow was validated using four datasets with distinct chemical and endpoint characteristics: cruzain inhibitors, biased μ-opioid receptor agonists, pesticides, and carbonyl compounds with hydration constants.Scientific ContributionThis work introduces SPARFlow, an KNIME-integrated workflow that combines data curation, activity-cliff detection, and modelability assessment for SAR and SPR studies. The workflow provides a unified implementation of key SAR analyses within a single KNIME pipeline. It updates implementations of established metrics, including MODI* and RMODI, together with complementary indices such as SARI and SALI. It ensures consistent data flow across all modules.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate prediction of compound-protein interactions (CPIs) is crucial for chemical biology and drug discovery. Despite recent advancements, existing deep learning (DL)-based CPI models often struggle to simultaneously achieve high generalization performance, quantify prediction confidence, and ensure explainability. Here, we propose ChemGLaM, a chemical genomics language model designed to address these three crucial challenges, thereby enabling reliable and explainable CPI predictions. ChemGLaM integrates independently pre-trained chemical and protein language models through an interaction block with a cross-attention mechanism, achieving near state-of-the-art performance in predicting novel CPIs at a low computational cost. Incorporating uncertainty estimation and attention visualization enables ChemGLaM to enhance the success rate of virtual screening and to provide molecular insights into CPIs. To demonstrate the practical impact of ChemGLaM, we constructed a publicly available database containing large-scale CPI predictions for every possible pairing between all 20,434 human proteins and all 11,455 drugs and validated its practical applicability in a case study on amyotrophic lateral sclerosis. ChemGLaM marks an important step forward in addressing the challenges of AI-driven CPI exploration and drug discovery.Scientific ContributionThis study established a unified CPI prediction framework that simultaneously achieves high generalization performance, confidence quantification, and explainability. We leveraged this framework to create a community resource by constructing a comprehensive CPI database and demonstrated its practical utility by successfully prioritizing hit compounds and deconvoluting their targets in a phenotypic screening for amyotrophic lateral sclerosis.
{"title":"Chemical genomics language model toward reliable and explainable compound-protein interaction exploration.","authors":"Takuto Koyama, Hayato Tsumura, Ryunosuke Okita, Kimihiro Yamazaki, Aki Hasegawa, Keiko Imamura, Takashi Kato, Hiroaki Iwata, Ryosuke Kojima, Haruhisa Inoue, Shigeyuki Matsumoto, Yasushi Okuno","doi":"10.1186/s13321-026-01155-z","DOIUrl":"https://doi.org/10.1186/s13321-026-01155-z","url":null,"abstract":"<p><p>Accurate prediction of compound-protein interactions (CPIs) is crucial for chemical biology and drug discovery. Despite recent advancements, existing deep learning (DL)-based CPI models often struggle to simultaneously achieve high generalization performance, quantify prediction confidence, and ensure explainability. Here, we propose ChemGLaM, a chemical genomics language model designed to address these three crucial challenges, thereby enabling reliable and explainable CPI predictions. ChemGLaM integrates independently pre-trained chemical and protein language models through an interaction block with a cross-attention mechanism, achieving near state-of-the-art performance in predicting novel CPIs at a low computational cost. Incorporating uncertainty estimation and attention visualization enables ChemGLaM to enhance the success rate of virtual screening and to provide molecular insights into CPIs. To demonstrate the practical impact of ChemGLaM, we constructed a publicly available database containing large-scale CPI predictions for every possible pairing between all 20,434 human proteins and all 11,455 drugs and validated its practical applicability in a case study on amyotrophic lateral sclerosis. ChemGLaM marks an important step forward in addressing the challenges of AI-driven CPI exploration and drug discovery.Scientific ContributionThis study established a unified CPI prediction framework that simultaneously achieves high generalization performance, confidence quantification, and explainability. We leveraged this framework to create a community resource by constructing a comprehensive CPI database and demonstrated its practical utility by successfully prioritizing hit compounds and deconvoluting their targets in a phenotypic screening for amyotrophic lateral sclerosis.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1186/s13321-026-01157-x
Ning-Ning Wang,Yuan-Hang He,Xin-Liang Li,Shao-Hua Shi,You-Chao Deng,Shao Liu,Dong-Sheng Cao
With the disclosure of the important role of substructural alerts (SA) in drug development and toxicity evaluation, many automatic substructure extraction tools based on different theoretical knowledge have been reported in recent years. To compare the emphasis of various substructure extraction methods and the reliability of their results, we were encouraged to conduct a comprehensive analysis of seven representative tools to find the best one. In this paper, we introduced a well-designed evaluation of seven popular tools (Bioalerts, KRFP, MoSS, PySmash_circular, PySmash_group, PySmash_path, and SARpy) based on 43 toxicity datasets, consisting of four components: comparison of substructures derived by different methods, comparison of predictive models based on substructural rules, comparison of the efficiency of extracting toxic substructures, and the effect of SAs on quantitative structure-activity relationship (QSAR) predictive models. The results demonstrated that PySmash_circular performed best overall, with satisfactory results in substructure information carrying and the rule-based predictive models. PySmash_path and Bioalerts were also recommended for their similar performance to PySmash_circular, but the main problem was that they took too much time and generated too many substructures. Specifically, Bioalerts and PySmash_circular could obtain substructures carrying richer information, while SARpy had the best predictive rule-based models, but it only focuses on precision (PR) value in the evaluation of individual SA. More than that, the substructures obtained by all 7 methods can enhance the recognition ability of the QSAR models for toxic compounds and make them interpretable. Finally, we have also made a baseline substructure set of 43 toxicity endpoints available to the public to facilitate further development of drug research and environmental safety assessment in a rapid and accurate direction.Scientific Contribution: Based on 43 toxicity datasets, we conducted a comprehensive evaluation of 7 representative substructure extraction tools from both the perspective of individual substructure and substructure-based models. This work not only enables users to make more autonomous choices of the optimal substructure extraction tool, but also provides the public with a benchmark substructure set of 43 toxicity endpoints, promoting the further development of computational toxicology.
{"title":"A comprehensive evaluation of advanced methods for identifying structural alerts using extensive toxicity data.","authors":"Ning-Ning Wang,Yuan-Hang He,Xin-Liang Li,Shao-Hua Shi,You-Chao Deng,Shao Liu,Dong-Sheng Cao","doi":"10.1186/s13321-026-01157-x","DOIUrl":"https://doi.org/10.1186/s13321-026-01157-x","url":null,"abstract":"With the disclosure of the important role of substructural alerts (SA) in drug development and toxicity evaluation, many automatic substructure extraction tools based on different theoretical knowledge have been reported in recent years. To compare the emphasis of various substructure extraction methods and the reliability of their results, we were encouraged to conduct a comprehensive analysis of seven representative tools to find the best one. In this paper, we introduced a well-designed evaluation of seven popular tools (Bioalerts, KRFP, MoSS, PySmash_circular, PySmash_group, PySmash_path, and SARpy) based on 43 toxicity datasets, consisting of four components: comparison of substructures derived by different methods, comparison of predictive models based on substructural rules, comparison of the efficiency of extracting toxic substructures, and the effect of SAs on quantitative structure-activity relationship (QSAR) predictive models. The results demonstrated that PySmash_circular performed best overall, with satisfactory results in substructure information carrying and the rule-based predictive models. PySmash_path and Bioalerts were also recommended for their similar performance to PySmash_circular, but the main problem was that they took too much time and generated too many substructures. Specifically, Bioalerts and PySmash_circular could obtain substructures carrying richer information, while SARpy had the best predictive rule-based models, but it only focuses on precision (PR) value in the evaluation of individual SA. More than that, the substructures obtained by all 7 methods can enhance the recognition ability of the QSAR models for toxic compounds and make them interpretable. Finally, we have also made a baseline substructure set of 43 toxicity endpoints available to the public to facilitate further development of drug research and environmental safety assessment in a rapid and accurate direction.Scientific Contribution: Based on 43 toxicity datasets, we conducted a comprehensive evaluation of 7 representative substructure extraction tools from both the perspective of individual substructure and substructure-based models. This work not only enables users to make more autonomous choices of the optimal substructure extraction tool, but also provides the public with a benchmark substructure set of 43 toxicity endpoints, promoting the further development of computational toxicology.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146089040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1186/s13321-026-01151-3
Seul Lee,Jooyeon Lee,Unghwi Yoon,Jahyun Koo,Young Wook Yoon,Yoonjae Cho,Seung-Ryul Hwang,Keunhong Jeong
The rapid proliferation of chemical substances presents significant challenges in assessing their safety-critical physicochemical properties. This study presents an integrated approach using Graph Neural Networks (GNNs) to predict three crucial properties for chemical safety assessment: Heat of Combustion (HoC), Vapor Pressure (VP), and Flashpoint. Leveraging comprehensive datasets of 4780, 3573, and 14,696 compounds respectively, we developed a unified prediction model that outperforms existing approaches. Our model achieves mean absolute errors of 126 J/mol (R2 = 0.993) for HoC, 0.617 log units (R2 = 0.898) for VP, and 14.42 °C (R2 = 0.839) for Flashpoint, representing notable improvements over conventional methods. Through detailed analysis, we identified and addressed a specific challenge in predicting HoC for cyclic compounds by implementing a hybrid approach combining DFT calculations and Random Forest modeling. This specialized treatment expanded our cyclic compound dataset from 12 to 55 compounds and achieved an R2 of 0.918 for these traditionally challenging structures. The model was integrated into a real-time prediction system using Flask, allowing users to input chemical structures through SMILES notation or direct drawing. The system includes features for comparing predictions with experimental data and benchmarking against common industrial chemicals (acetone, n-hexane, and n-decane), enhancing its practical utility in emergency response scenarios. Our approach provides a robust, unified solution for predicting multiple safety-critical properties simultaneously, addressing a crucial need in chemical safety assessment and emergency response planning. SCIENTIFIC CONTRIBUTION: Overall, this study provides an integrated framework that deploys three GNN-based prediction models within a common architecture and a real-time prediction system. For cyclic compounds, which exhibit systematic prediction challenges under the GNN framework, we incorporate a targeted alternative modeling strategy to improve predictive reliability, thereby enhancing the practical applicability of machine-learning approaches to chemical safety assessment.
{"title":"Advancing chemical safety prediction: an integrated GNN framework with DFT-augmented cyclic compound solution.","authors":"Seul Lee,Jooyeon Lee,Unghwi Yoon,Jahyun Koo,Young Wook Yoon,Yoonjae Cho,Seung-Ryul Hwang,Keunhong Jeong","doi":"10.1186/s13321-026-01151-3","DOIUrl":"https://doi.org/10.1186/s13321-026-01151-3","url":null,"abstract":"The rapid proliferation of chemical substances presents significant challenges in assessing their safety-critical physicochemical properties. This study presents an integrated approach using Graph Neural Networks (GNNs) to predict three crucial properties for chemical safety assessment: Heat of Combustion (HoC), Vapor Pressure (VP), and Flashpoint. Leveraging comprehensive datasets of 4780, 3573, and 14,696 compounds respectively, we developed a unified prediction model that outperforms existing approaches. Our model achieves mean absolute errors of 126 J/mol (R2 = 0.993) for HoC, 0.617 log units (R2 = 0.898) for VP, and 14.42 °C (R2 = 0.839) for Flashpoint, representing notable improvements over conventional methods. Through detailed analysis, we identified and addressed a specific challenge in predicting HoC for cyclic compounds by implementing a hybrid approach combining DFT calculations and Random Forest modeling. This specialized treatment expanded our cyclic compound dataset from 12 to 55 compounds and achieved an R2 of 0.918 for these traditionally challenging structures. The model was integrated into a real-time prediction system using Flask, allowing users to input chemical structures through SMILES notation or direct drawing. The system includes features for comparing predictions with experimental data and benchmarking against common industrial chemicals (acetone, n-hexane, and n-decane), enhancing its practical utility in emergency response scenarios. Our approach provides a robust, unified solution for predicting multiple safety-critical properties simultaneously, addressing a crucial need in chemical safety assessment and emergency response planning. SCIENTIFIC CONTRIBUTION: Overall, this study provides an integrated framework that deploys three GNN-based prediction models within a common architecture and a real-time prediction system. For cyclic compounds, which exhibit systematic prediction challenges under the GNN framework, we incorporate a targeted alternative modeling strategy to improve predictive reliability, thereby enhancing the practical applicability of machine-learning approaches to chemical safety assessment.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"88 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1186/s13321-026-01150-4
Ahmed Karam,Asmaa Ramzy,Taghreed Khaled Abdelmoneim,Maha Mokhtar,Nada A Youssef,Aya Osama,Nabila Sabar,Sameh Magdeldin
The expansion of untargeted metabolomics has made publicly accessible spectral libraries indispensable for metabolite annotation and machine learning applications. Enhancing the quality and consistency of these libraries is crucial for improving the accuracy of metabolite identification and training machine learning models. However, public spectral libraries often suffer from variability in user submissions, unintentional errors, and a lack of standardization. Existing metadata cleaning and normalization tools typically exclude spectra with incorrect or unsupported metadata rather than attempting to correct them, resulting in the loss of valuable spectral data and associated metabolites details. This study introduces STRIKER (SpecTRal lIbrary maKER), a repair tool specifically designed to address adduct metadata deficiencies using a distance-based metric and a deep learning model. STRIKER leverages advanced similarity-based approaches to predict adducts in spectra lacking adduct metadata. It corrects adduct-related errors and standardizes adduct formatting using a deep learning model based on the multi-layer perceptron (MLP) algorithm. STRIKER achieved 95-99% correct adduct matching and 98% adduct correction accuracy. These corrections substantially reduce the number of missing or unusable spectra and metabolites, thereby enhancing the accuracy of metabolite identification and improving data quality for machine learning applications. The tool also facilitates a convenient construction of the Human Metabolome Database (HMDB) spectral library by integrating data files from the HMDB website. Furthermore, it enables users to extract customized sub libraries from larger libraries, supporting tailored analyses for specific research objectives with percised search space. STRIKER is an open-source, user-friendly Python graphical interface designed to be accessible to researchers with minimal bioinformatics expertise. Available at the following repository under an MIT license: https://striker-gui.sourceforge.io.Scientific contributionThe software is designed to preserve the maximum number of valid spectra in open mass spectral libraries, thereby supporting more comprehensive metabolite annotation in untargeted metabolomics. Its graphical user interface further facilitates the engagement of researchers without programming expertise, enabling them to enhance the quality and usability of spectral libraries.
{"title":"STRIKER: a spectral metadata repairing tool for expanding the comprehensiveness of spectral libraries.","authors":"Ahmed Karam,Asmaa Ramzy,Taghreed Khaled Abdelmoneim,Maha Mokhtar,Nada A Youssef,Aya Osama,Nabila Sabar,Sameh Magdeldin","doi":"10.1186/s13321-026-01150-4","DOIUrl":"https://doi.org/10.1186/s13321-026-01150-4","url":null,"abstract":"The expansion of untargeted metabolomics has made publicly accessible spectral libraries indispensable for metabolite annotation and machine learning applications. Enhancing the quality and consistency of these libraries is crucial for improving the accuracy of metabolite identification and training machine learning models. However, public spectral libraries often suffer from variability in user submissions, unintentional errors, and a lack of standardization. Existing metadata cleaning and normalization tools typically exclude spectra with incorrect or unsupported metadata rather than attempting to correct them, resulting in the loss of valuable spectral data and associated metabolites details. This study introduces STRIKER (SpecTRal lIbrary maKER), a repair tool specifically designed to address adduct metadata deficiencies using a distance-based metric and a deep learning model. STRIKER leverages advanced similarity-based approaches to predict adducts in spectra lacking adduct metadata. It corrects adduct-related errors and standardizes adduct formatting using a deep learning model based on the multi-layer perceptron (MLP) algorithm. STRIKER achieved 95-99% correct adduct matching and 98% adduct correction accuracy. These corrections substantially reduce the number of missing or unusable spectra and metabolites, thereby enhancing the accuracy of metabolite identification and improving data quality for machine learning applications. The tool also facilitates a convenient construction of the Human Metabolome Database (HMDB) spectral library by integrating data files from the HMDB website. Furthermore, it enables users to extract customized sub libraries from larger libraries, supporting tailored analyses for specific research objectives with percised search space. STRIKER is an open-source, user-friendly Python graphical interface designed to be accessible to researchers with minimal bioinformatics expertise. Available at the following repository under an MIT license: https://striker-gui.sourceforge.io.Scientific contributionThe software is designed to preserve the maximum number of valid spectra in open mass spectral libraries, thereby supporting more comprehensive metabolite annotation in untargeted metabolomics. Its graphical user interface further facilitates the engagement of researchers without programming expertise, enabling them to enhance the quality and usability of spectral libraries.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"8 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146056729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1186/s13321-025-01133-x
Rajarshi Guha
{"title":"Paths to cheminformatics: Q&A with Rajarshi Guha","authors":"Rajarshi Guha","doi":"10.1186/s13321-025-01133-x","DOIUrl":"10.1186/s13321-025-01133-x","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01133-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}