Pub Date : 2025-11-12DOI: 10.1186/s13321-025-01110-4
Minsu Park, Yewon Shin, Hyunho Kim, Hojung Nam
The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.
Scientific contribution: We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.
{"title":"Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information","authors":"Minsu Park, Yewon Shin, Hyunho Kim, Hojung Nam","doi":"10.1186/s13321-025-01110-4","DOIUrl":"10.1186/s13321-025-01110-4","url":null,"abstract":"<div><p>The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.</p><p><b>Scientific contribution:</b> We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01110-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1186/s13321-025-01109-x
Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter
Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient estimator2generative wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application C2PO (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.
Scientific contribution: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.
{"title":"C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification","authors":"Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter","doi":"10.1186/s13321-025-01109-x","DOIUrl":"10.1186/s13321-025-01109-x","url":null,"abstract":"<div><p>Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient <i>estimator2generative</i> wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application <i>C2PO</i> (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.</p><p><b>Scientific contribution</b>: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01109-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.
Scientific contribution
We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.
{"title":"HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and d-amino acids","authors":"Zhigang Cao, Sen Cao, Linghong Wang, Zhiguo Wang, Qingyi Mao, Jingjing Guo, Hongliang Duan","doi":"10.1186/s13321-025-01111-3","DOIUrl":"10.1186/s13321-025-01111-3","url":null,"abstract":"<div><h3>Abstract</h3><p>Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.</p><h3>Scientific contribution</h3><p>We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01111-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145478487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1186/s13321-025-01102-4
Huynh Anh Duy, Tarapong Srisongkram
Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.
{"title":"Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework","authors":"Huynh Anh Duy, Tarapong Srisongkram","doi":"10.1186/s13321-025-01102-4","DOIUrl":"10.1186/s13321-025-01102-4","url":null,"abstract":"<div><p>Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01102-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145427985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1186/s13321-025-01079-0
Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina
The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework.
{"title":"Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework","authors":"Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina","doi":"10.1186/s13321-025-01079-0","DOIUrl":"10.1186/s13321-025-01079-0","url":null,"abstract":"<div><p>The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01079-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1186/s13321-025-01107-z
Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch
Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.
{"title":"Efficient decoy selection to improve virtual screening using machine learning models","authors":"Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch","doi":"10.1186/s13321-025-01107-z","DOIUrl":"10.1186/s13321-025-01107-z","url":null,"abstract":"<div><p>Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01107-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.
数据异质性和分布偏差对机器学习模型构成了严峻的挑战,往往会影响预测的准确性。这些挑战体现在临床前安全性建模中,这是早期药物发现的关键一步,有限的数据和实验限制加剧了整合问题。通过分析公共ADME数据集,我们发现了金标准和流行基准资源(如Therapeutic Data Commons)之间的显著不一致以及不一致的属性注释。这些数据集差异可能源于各种因素的差异,包括数据收集中的实验条件以及化学空间覆盖范围,这些差异可能引入噪声并最终降低模型性能。数据标准化,尽管协调差异和增加训练集的大小,可能并不总是导致预测性能的改进。这突出了在建模之前严格的数据一致性评估(DCA)的重要性。为了促进跨不同数据集的系统DCA,我们开发了AssayInspector,这是一个模型不可知的软件包,它利用统计数据、可视化和诊断摘要来识别异常值、批处理效果和差异。除了临床前安全性之外,DCA还可以在联邦学习场景中发挥关键作用,实现跨异构数据源的有效迁移学习,并支持跨不同科学领域的可靠集成。通过系统地分析公共ADME数据集,我们发现了基准和金标准来源之间的大量分布不一致和注释差异。这些挑战被证明会破坏预测建模,因为幼稚的集成或标准化通常会降低性能。为了解决这些问题,我们提出了AssayInspector,这是一种能够进行一致性评估和知情数据集成的工具,为药物发现中更可靠的预测建模提供了基础。
{"title":"Enhancing molecular property prediction through data integration and consistency assessment","authors":"Raquel Parrondo-Pizarro, Luca Menestrina, Ricard Garcia-Serna, Adrià Fernández-Torras, Jordi Mestres","doi":"10.1186/s13321-025-01103-3","DOIUrl":"10.1186/s13321-025-01103-3","url":null,"abstract":"<div><p>Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01103-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1186/s13321-025-01098-x
Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner
Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.
The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.
{"title":"HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions","authors":"Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner","doi":"10.1186/s13321-025-01098-x","DOIUrl":"10.1186/s13321-025-01098-x","url":null,"abstract":"<p> Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.</p><p>The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01098-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1186/s13321-025-01104-2
Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi
Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.
{"title":"VNFlow: integration of variational autoencoders and normalizing flows for novel molecular design","authors":"Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi","doi":"10.1186/s13321-025-01104-2","DOIUrl":"10.1186/s13321-025-01104-2","url":null,"abstract":"<div><p>Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01104-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1186/s13321-025-01082-5
Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová
The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (gmx dump) output. The tool takes the run input simulation file (.tpr) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (.top, .gro). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.
This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.
{"title":"Gromacs MetaDump: a tool for extracting GROMACS simulation metadata","authors":"Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová","doi":"10.1186/s13321-025-01082-5","DOIUrl":"10.1186/s13321-025-01082-5","url":null,"abstract":"<p>The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (<span>gmx dump</span>) output. The tool takes the run input simulation file (<span>.tpr</span>) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (<span>.top</span>, <span>.gro</span>). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.</p><p>This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01082-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}