Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.
Scientific contribution
We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.
{"title":"HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and d-amino acids","authors":"Zhigang Cao, Sen Cao, Linghong Wang, Zhiguo Wang, Qingyi Mao, Jingjing Guo, Hongliang Duan","doi":"10.1186/s13321-025-01111-3","DOIUrl":"10.1186/s13321-025-01111-3","url":null,"abstract":"<div><h3>Abstract</h3><p>Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.</p><h3>Scientific contribution</h3><p>We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01111-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145478487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1186/s13321-025-01102-4
Huynh Anh Duy, Tarapong Srisongkram
Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.
{"title":"Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework","authors":"Huynh Anh Duy, Tarapong Srisongkram","doi":"10.1186/s13321-025-01102-4","DOIUrl":"10.1186/s13321-025-01102-4","url":null,"abstract":"<div><p>Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01102-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145427985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1186/s13321-025-01079-0
Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina
The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework.
{"title":"Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework","authors":"Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina","doi":"10.1186/s13321-025-01079-0","DOIUrl":"10.1186/s13321-025-01079-0","url":null,"abstract":"<div><p>The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01079-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1186/s13321-025-01107-z
Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch
Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.
{"title":"Efficient decoy selection to improve virtual screening using machine learning models","authors":"Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch","doi":"10.1186/s13321-025-01107-z","DOIUrl":"10.1186/s13321-025-01107-z","url":null,"abstract":"<div><p>Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01107-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.
数据异质性和分布偏差对机器学习模型构成了严峻的挑战,往往会影响预测的准确性。这些挑战体现在临床前安全性建模中,这是早期药物发现的关键一步,有限的数据和实验限制加剧了整合问题。通过分析公共ADME数据集,我们发现了金标准和流行基准资源(如Therapeutic Data Commons)之间的显著不一致以及不一致的属性注释。这些数据集差异可能源于各种因素的差异,包括数据收集中的实验条件以及化学空间覆盖范围,这些差异可能引入噪声并最终降低模型性能。数据标准化,尽管协调差异和增加训练集的大小,可能并不总是导致预测性能的改进。这突出了在建模之前严格的数据一致性评估(DCA)的重要性。为了促进跨不同数据集的系统DCA,我们开发了AssayInspector,这是一个模型不可知的软件包,它利用统计数据、可视化和诊断摘要来识别异常值、批处理效果和差异。除了临床前安全性之外,DCA还可以在联邦学习场景中发挥关键作用,实现跨异构数据源的有效迁移学习,并支持跨不同科学领域的可靠集成。通过系统地分析公共ADME数据集,我们发现了基准和金标准来源之间的大量分布不一致和注释差异。这些挑战被证明会破坏预测建模,因为幼稚的集成或标准化通常会降低性能。为了解决这些问题,我们提出了AssayInspector,这是一种能够进行一致性评估和知情数据集成的工具,为药物发现中更可靠的预测建模提供了基础。
{"title":"Enhancing molecular property prediction through data integration and consistency assessment","authors":"Raquel Parrondo-Pizarro, Luca Menestrina, Ricard Garcia-Serna, Adrià Fernández-Torras, Jordi Mestres","doi":"10.1186/s13321-025-01103-3","DOIUrl":"10.1186/s13321-025-01103-3","url":null,"abstract":"<div><p>Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01103-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1186/s13321-025-01098-x
Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner
Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.
The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.
{"title":"HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions","authors":"Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner","doi":"10.1186/s13321-025-01098-x","DOIUrl":"10.1186/s13321-025-01098-x","url":null,"abstract":"<p> Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.</p><p>The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01098-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1186/s13321-025-01104-2
Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi
Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.
{"title":"VNFlow: integration of variational autoencoders and normalizing flows for novel molecular design","authors":"Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi","doi":"10.1186/s13321-025-01104-2","DOIUrl":"10.1186/s13321-025-01104-2","url":null,"abstract":"<div><p>Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01104-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1186/s13321-025-01082-5
Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová
The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (gmx dump) output. The tool takes the run input simulation file (.tpr) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (.top, .gro). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.
This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.
{"title":"Gromacs MetaDump: a tool for extracting GROMACS simulation metadata","authors":"Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová","doi":"10.1186/s13321-025-01082-5","DOIUrl":"10.1186/s13321-025-01082-5","url":null,"abstract":"<p>The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (<span>gmx dump</span>) output. The tool takes the run input simulation file (<span>.tpr</span>) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (<span>.top</span>, <span>.gro</span>). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.</p><p>This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01082-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-21DOI: 10.1186/s13321-025-01113-1
Hongtao Zhao, Eva Nittinger, Melissa A. Yu, Symon Gathiaka, W. Patrick Walters, Christian Tyrchan
{"title":"Correction: Enhanced Thompson sampling by roulette wheel selection for screening ultralarge combinatorial libraries","authors":"Hongtao Zhao, Eva Nittinger, Melissa A. Yu, Symon Gathiaka, W. Patrick Walters, Christian Tyrchan","doi":"10.1186/s13321-025-01113-1","DOIUrl":"10.1186/s13321-025-01113-1","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01113-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145342639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1186/s13321-025-01097-y
Ludovica Bono, Filippo Lunghini, Emanuela Sabato, Akash Deep Biswas, Angelica Mazzolari, Alessandro Pedretti, Andrea R. Beccari, Giulio Vistoli, Serena Vittorio
The prediction of drugs metabolism by in silico techniques is gaining a growing interest due to the possibility to process large datasets allowing the stability and safety of new drug candidates to be evaluated during the early stages of the drug discovery process. To date, in silico models for metabolism prediction mainly exploits the ligand-based (LB) properties of the training molecules to predict the occurrence of a given metabolic reaction and/or the reactive site involved in the biotransformation. However, recent reports highlighted that structure-based (SB) modeling can be conveniently integrated with LB methods for drug metabolism prediction purpose, with the advantages to predict if a given molecule can fit the enzyme active site and which moiety approaches the catalytic residues. Herein, we developed machine learning models for UDP-glucuronosyltransferase (UGT)-mediated metabolism by using both LB and SB methods. In particular, this study was focused on UGT2B7 and UGT2B15 isoforms which are involved in the clearance of many drugs as well as in clinically relevant drug-drug interactions. First, molecular dynamics (MD) and docking simulations were combined to explore the binding mechanism of cofactor and substrate within the catalytic pocket of the studied UGT isoforms exploiting their AlphaFold structures. The analysis of the MD trajectories allowed an appropriate conformation of both UGT isoforms to be identified for the development of binary classification models. For this purpose, Random Forest algorithm and the metabolic data extracted from the MetaQSAR database were used. SB models were trained on a set of scoring functions and protein–ligand interaction fingerprints derived from docking, while the LB models were built on a set of physicochemical and constitutional descriptors. When the single models were evaluated, the LB classifiers outperformed the SB models. However, the application of a consensus strategy led to an improvement of the prediction accuracy if compared to the individual models, highlighting that LB and SB approaches convey complementary information whose aggregation allowed us to achieve better predictions than the single models.
{"title":"Prediction of UGT-mediated phase II metabolism via ligand- and structure-based predictive models","authors":"Ludovica Bono, Filippo Lunghini, Emanuela Sabato, Akash Deep Biswas, Angelica Mazzolari, Alessandro Pedretti, Andrea R. Beccari, Giulio Vistoli, Serena Vittorio","doi":"10.1186/s13321-025-01097-y","DOIUrl":"10.1186/s13321-025-01097-y","url":null,"abstract":"<div><p>The prediction of drugs metabolism by in silico techniques is gaining a growing interest due to the possibility to process large datasets allowing the stability and safety of new drug candidates to be evaluated during the early stages of the drug discovery process. To date, in silico models for metabolism prediction mainly exploits the ligand-based (LB) properties of the training molecules to predict the occurrence of a given metabolic reaction and/or the reactive site involved in the biotransformation. However, recent reports highlighted that structure-based (SB) modeling can be conveniently integrated with LB methods for drug metabolism prediction purpose, with the advantages to predict if a given molecule can fit the enzyme active site and which moiety approaches the catalytic residues. Herein, we developed machine learning models for UDP-glucuronosyltransferase (UGT)-mediated metabolism by using both LB and SB methods. In particular, this study was focused on UGT2B7 and UGT2B15 isoforms which are involved in the clearance of many drugs as well as in clinically relevant drug-drug interactions. First, molecular dynamics (MD) and docking simulations were combined to explore the binding mechanism of cofactor and substrate within the catalytic pocket of the studied UGT isoforms exploiting their AlphaFold structures. The analysis of the MD trajectories allowed an appropriate conformation of both UGT isoforms to be identified for the development of binary classification models. For this purpose, Random Forest algorithm and the metabolic data extracted from the MetaQSAR database were used. SB models were trained on a set of scoring functions and protein–ligand interaction fingerprints derived from docking, while the LB models were built on a set of physicochemical and constitutional descriptors. When the single models were evaluated, the LB classifiers outperformed the SB models. However, the application of a consensus strategy led to an improvement of the prediction accuracy if compared to the individual models, highlighting that LB and SB approaches convey complementary information whose aggregation allowed us to achieve better predictions than the single models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01097-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}