Journal of Cheminformatics最新文献_第6页

Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information 通过化学知识和体外毒性信息的集成知识转移，增强多任务体内毒性预测

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-11-12 DOI: 10.1186/s13321-025-01110-4

Minsu Park, Yewon Shin, Hyunho Kim, Hojung Nam

The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.

Scientific contribution: We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.

药物潜在毒性的评价是药物早期开发的关键步骤。在进行临床试验之前，体内毒性评估是必须解决的关键挑战。然而，传统的体内实验主要依赖于动物模型，这引起了对成本、时间效率和伦理考虑的担忧。为了应对这些挑战，已经开发了各种计算方法来支持体内毒性评估，尽管由于数据稀缺，这些方法通常表现出有限的通用性。在这项研究中，我们提出了MT-Tox，这是一个基于知识转移的多任务学习模型，专为克服数据稀缺的体内毒性预测而设计。我们的模型在三个阶段实现了顺序知识转移策略：一般化学知识预训练，体外毒理学辅助训练和体内毒性微调。这种分层方法通过系统地利用来自化学结构和毒性数据源的信息，显著提高了模型的性能。MT-Tox在三个体内毒性终点上优于基线模型：致癌性、药物性肝损伤（DILI）和遗传毒性。通过消融研究和注意力分析，我们证明了每种知识转移技术对预测过程都有有意义的贡献。最后，我们通过全面的DrugBank数据库筛选，展示了我们的模型作为早期药物发现预测工具的实际应用。科学贡献：我们提出了一个整合化学和体外毒理学信息的知识转移框架，以增强在低数据制度下的体内毒性预测。我们的模型通过注意机制提供化学和生物领域的双重可解释性。此外，我们通过筛选DrugBank数据库，模拟药物开发中的实际毒性筛选场景，证明了我们的模型的适用性。

{"title":"Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information","authors":"Minsu Park, Yewon Shin, Hyunho Kim, Hojung Nam","doi":"10.1186/s13321-025-01110-4","DOIUrl":"10.1186/s13321-025-01110-4","url":null,"abstract":"<div><p>The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.</p><p><b>Scientific contribution:</b> We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01110-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification C2PO：通过化学修饰环肽膜通透性的ml优化器。

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-11-11 DOI: 10.1186/s13321-025-01109-x

Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter

Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient estimator2generative wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application C2PO (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.

Scientific contribution: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.

肽药物作为一种介于小分子和大分子之间的模式，目前正受到人们的重视。治疗性多肽代表了实现高效、选择性和达到细胞内目标的机会。随着环状肽的出现，治疗性肽的发展进入了一个新时代，环状肽通过实现充分的口服生物利用度，避免了肠外给药的限制。然而，提高环肽的膜通透性仍然是主要的瓶颈之一。在这里，我们引入了一个基于公开数据的环肽膜通透性的深度学习回归模型。该模型从化学结构开始，超越了有限的词汇语言模型，将其推广到训练数据集中以外的单体。此外，我们还引入了一个有效的估计器生成包装器，使该模型能够通过化学修饰直接进行膜透性的分子优化。我们将我们的应用程序命名为C2PO（循环肽通透性优化器）。最后，我们演示了如何使用分子校正工具来限制生成的分子中不熟悉的化学物质的存在。科学贡献：我们提供了一个ml驱动的优化器应用程序，名为C2PO，它返回结构修饰的环状肽，具有改善的膜通透性，这是药物发现和开发的关键任务之一。C2PO是一种一流的环肽渗透性改善应用，它将ML模型转换为化学结构的生成优化器。此外，通过演示，我们鼓励使用带有化学参考库的自动后校正工具来纠正C2PO的奇怪化学输出，这是ml生成的化学结构的已知问题。

{"title":"C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification","authors":"Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter","doi":"10.1186/s13321-025-01109-x","DOIUrl":"10.1186/s13321-025-01109-x","url":null,"abstract":"<div><p>Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient <i>estimator2generative</i> wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application <i>C2PO</i> (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.</p><p><b>Scientific contribution</b>: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01109-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and d-amino acids 高折叠- med：一个罗塞塔蒸馏模型加速结构预测与主干n -甲基化和d-氨基酸环肽

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-11-10 DOI: 10.1186/s13321-025-01111-3

Zhigang Cao, Sen Cao, Linghong Wang, Zhiguo Wang, Qingyi Mao, Jingjing Guo, Hongliang Duan

Abstract

Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.

Scientific contribution

We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.

含有主干n -甲基化氨基酸（BNMeAAs）和d -氨基酸（D-AAs）的环肽因其稳定性、膜渗透性和其他治疗潜力而受到越来越多的关注。目前，Rosetta simple_cycpep_predict （SCP）可以通过基于能量的计算来预测它们的结构，但这种方法计算量大，耗时长。此外，这种环肽的可用晶体结构仍然非常有限，阻碍了数据驱动结构预测模型的发展。为了应对这些挑战，我们提出了HighFold-MeD，这是一个基于深度学习的框架，通过微调AlphaFold模型从Rosetta SCP中提取知识。首先，利用Rosetta SCP软件，对含有BNMeAAs和D-AAs的环状肽的大量构象进行采样，并对其能量评分进行评估，构建环状肽结构数据集。然后对AlphaFold模型进行微调，以纳入扩展的56个bnmeaa和d - aa。此外，引入了一个相对位置循环矩阵来显式地模拟头尾循环。最后，在预测结构中采用力场最小化空间冲突。实证实验表明，HighFold-MeD通过Rosetta的SCP模块，在采样数据集的基础上，以nstruct = 500和cyclic_peptide: genkic_closure_attempts = 1000为关键参数，实现了与Rosetta相当的准确率，同时将结构预测速度提高了50倍，从而显著加快了基于环肽的治疗药物的开发。我们提出了HighFold-MeD，它提供了一种快速和相对准确的方法来预测含有主链n -甲基化氨基酸和d-氨基酸的环肽的结构，这些环肽是肽药物设计的关键组成部分。通过将特定参数下的Rosetta SCP知识提取到经过微调的AlphaFold框架中，我们的方法在保持相对较高准确性的同时实现了50倍的加速，从而实现了大规模环肽药物设计。

{"title":"HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and d-amino acids","authors":"Zhigang Cao, Sen Cao, Linghong Wang, Zhiguo Wang, Qingyi Mao, Jingjing Guo, Hongliang Duan","doi":"10.1186/s13321-025-01111-3","DOIUrl":"10.1186/s13321-025-01111-3","url":null,"abstract":"<div><h3>Abstract</h3><p>Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.</p><h3>Scientific contribution</h3><p>We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01111-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145478487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework 利用多模态深度学习框架准确预测抗氧化肽的构效关系

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-11-03 DOI: 10.1186/s13321-025-01102-4

Huynh Anh Duy, Tarapong Srisongkram

Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.

抗氧化肽（AOPs）因其抗氧化应激相关疾病的功效以及在功能性食品和化妆品行业的应用而成为一种很有前景的肽制剂。在这项研究中，我们利用多模态深度学习框架开发了一个全面的定量结构-活动关系（QSAR），该框架将6种基于序列的结构表示与堆叠集成神经架构（卷积神经网络、双向长短期记忆和transformer）集成在一起，以提高预测准确性。此外，我们采用生成模型来设计新的AOP候选对象，随后使用性能最佳的QSAR模型对其进行评估。值得注意的是，使用单热编码的叠加模型取得了出色的预测指标，精度、AUROC和AUPRC值超过0.90，MCC值超过0.80，证明了QSAR模型的高准确性和鲁棒性。SHAP分析强调，脯氨酸、亮氨酸、丙氨酸、酪氨酸和甘氨酸是对抗氧化活性有正面影响的前五大残基，而蛋氨酸、半胱氨酸、色氨酸、天冬酰胺和苏氨酸对抗氧化活性有负面影响。最后，通过计算确定了604个高置信度的aop。该研究表明，多模态框架提高了AOP的预测精度、鲁棒性和可解释性。它还能够有效地发现高潜力的AOPs，从而为加速药物和功能应用中的肽发现提供了强大的管道。本研究引入了一种新的多模态QSAR框架，该框架结合了不同序列表示和堆叠神经网络来增强抗氧化肽的预测。通过整合SHAP分析和生成建模，它可以实现可解释的特征归属和604个高电位肽的发现。该框架实现了卓越的预测性能，并为加速制药和功能食品应用中抗氧化肽的发现提供了一个可重复的、有效的管道。

{"title":"Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework","authors":"Huynh Anh Duy, Tarapong Srisongkram","doi":"10.1186/s13321-025-01102-4","DOIUrl":"10.1186/s13321-025-01102-4","url":null,"abstract":"<div><p>Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01102-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145427985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework 测量化学LLM对分子表征的稳健性：一个基于SMILES变化的框架

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-10-30 DOI: 10.1186/s13321-025-01079-0

Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina

The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework.

最近自然语言处理与化学的结合促进了药物的发现。语言模型（LMs）中的分子表示对于增强化学理解至关重要。我们探索了模型匹配相同化学结构的能力，尽管它们的表示不同。在不同的表述中认识同一物质是模拟理解化学如何工作的重要组成部分。我们提出了增强分子检索（AMORE），这是一个灵活的零射击框架，用于评估不同性质的化学lm。该框架基于维持基本化学结构的SMILES增强。所提出的方法促进了分子的嵌入表示及其SMILES变化与另一个分子的嵌入表示之间的相似性。实验表明，所测试的chemllm对不同的smile表示仍然不具有鲁棒性。我们在各种任务上对模型进行了评估，包括ChEBI-20基准上的分子标注和MoleculeNet基准上的分类和回归任务。我们表明，当SMILES字符串的变化与所提出的AMORE框架一致时，结果会发生变化。我们提出了一个基于化学语言模型（ChemLMs）内部嵌入空间的评估框架AMORE。AMORE使用增强技术，重新制定smile的分子结构串，以评估化学表征。提出的框架允许在没有昂贵的手动注释数据的情况下评估chemlm。

{"title":"Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework","authors":"Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina","doi":"10.1186/s13321-025-01079-0","DOIUrl":"10.1186/s13321-025-01079-0","url":null,"abstract":"<div><p>The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01079-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient decoy selection to improve virtual screening using machine learning models 利用机器学习模型提高虚拟筛选的有效诱饵选择

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-10-30 DOI: 10.1186/s13321-025-01107-z

Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch

Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.

使用蛋白质配体相互作用指纹的机器学习模型在药物发现中显示出作为靶标特异性评分函数的希望，但它们的性能严重依赖于潜在的诱饵选择策略。认识到这在模型性能中的关键作用，分析了各种诱饵选择策略，以增强基于每原子蛋白质分数贡献衍生相互作用指纹（PADIF）的机器学习模型。我们探索了三种不同的诱饵选择工作流程：(1)从广泛的数据库（如ZINC15）中随机选择，(2)利用存储为暗化学物质的高通量筛选（HTS）分析中反复出现的非结合物，以及(3)利用对接结果的不同构象来增强数据。ChEMBL的活性分子与这些诱饵方法相结合，用于训练和测试基于PADIF的不同机器学习模型。最后的验证是通过从LIT-PCBA数据集中确认实验确定的非活性化合物来完成的。我们的研究结果表明，使用随机选择的ZINC15和来自暗化学物质的化合物训练的模型非常接近于使用实际非粘合剂训练的模型的性能，这为在缺乏特定非活性数据的情况下创建准确的模型提供了可行的替代方案。此外，所有模型都显示出针对特定靶点探索新化学空间的能力增强，并且比经典评分函数增强了对顶级活性化合物的选择，从而提高了分子对接的筛选能力。这些发现表明，适当的诱饵选择策略可以在缺乏大量实验数据的情况下保持模型的准确性，同时扩大对目标的适用性。本研究增强了诱饵从公共数据集创建新的生物活性机器学习分类模型的可用选项。通过分子对接协议和PADIF表示，我们确定了使用该方法改进分子对接中粘合剂选择的好处

{"title":"Efficient decoy selection to improve virtual screening using machine learning models","authors":"Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch","doi":"10.1186/s13321-025-01107-z","DOIUrl":"10.1186/s13321-025-01107-z","url":null,"abstract":"<div><p>Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01107-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing molecular property prediction through data integration and consistency assessment 通过数据整合和一致性评估加强分子性质预测

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-10-29 DOI: 10.1186/s13321-025-01103-3

Raquel Parrondo-Pizarro, Luca Menestrina, Ricard Garcia-Serna, Adrià Fernández-Torras, Jordi Mestres

Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.

数据异质性和分布偏差对机器学习模型构成了严峻的挑战，往往会影响预测的准确性。这些挑战体现在临床前安全性建模中，这是早期药物发现的关键一步，有限的数据和实验限制加剧了整合问题。通过分析公共ADME数据集，我们发现了金标准和流行基准资源（如Therapeutic Data Commons）之间的显著不一致以及不一致的属性注释。这些数据集差异可能源于各种因素的差异，包括数据收集中的实验条件以及化学空间覆盖范围，这些差异可能引入噪声并最终降低模型性能。数据标准化，尽管协调差异和增加训练集的大小，可能并不总是导致预测性能的改进。这突出了在建模之前严格的数据一致性评估（DCA）的重要性。为了促进跨不同数据集的系统DCA，我们开发了AssayInspector，这是一个模型不可知的软件包，它利用统计数据、可视化和诊断摘要来识别异常值、批处理效果和差异。除了临床前安全性之外，DCA还可以在联邦学习场景中发挥关键作用，实现跨异构数据源的有效迁移学习，并支持跨不同科学领域的可靠集成。通过系统地分析公共ADME数据集，我们发现了基准和金标准来源之间的大量分布不一致和注释差异。这些挑战被证明会破坏预测建模，因为幼稚的集成或标准化通常会降低性能。为了解决这些问题，我们提出了AssayInspector，这是一种能够进行一致性评估和知情数据集成的工具，为药物发现中更可靠的预测建模提供了基础。

{"title":"Enhancing molecular property prediction through data integration and consistency assessment","authors":"Raquel Parrondo-Pizarro, Luca Menestrina, Ricard Garcia-Serna, Adrià Fernández-Torras, Jordi Mestres","doi":"10.1186/s13321-025-01103-3","DOIUrl":"10.1186/s13321-025-01103-3","url":null,"abstract":"<div><p>Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01103-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions 一个用于分配聚合反应中单体smile的头部和尾部位置的开源软件。

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-10-28 DOI: 10.1186/s13321-025-01098-x

Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner

Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.

The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.

人工智能（AI）技术正在改变聚合物的计算发现和设计。聚合物信息学的关键促成因素是机器可读的聚合物构建块（即单体）的分子字符串表示。在单体链中，如SMILES，头部和尾部原子的符号表示聚合过程中键形成的位置。由于单体的连接决定了聚合物的性质，因此人工智能预测模型的性能最终将受到单体SMILES中头部和尾部分配的准确性的限制。考虑到化学数据库中有大量的聚合物前体，需要可靠的方法来自动分配头尾原子。在这里，我们报告了一种基于亲核性原子指数，通过分析其官能团的反应性来分配单体SMILES头尾原子的方法。在包含206个聚合物前体的参考数据集中，HeadTailAssign （HTA）算法正确预测了204个单体SMILES的聚合物类别，准确率达到99%。头部和尾部原子被正确地分配给187个单体smile，准确率为91%。HTA代码可在https://github.com/IBM/HeadTailAssign上进行验证和重用。科学贡献：该算法成功地应用于数据预处理，通过标记单体的链接键来定义聚合反应中的重复单元。

{"title":"HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions","authors":"Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner","doi":"10.1186/s13321-025-01098-x","DOIUrl":"10.1186/s13321-025-01098-x","url":null,"abstract":"<p> Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.</p><p>The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01098-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VNFlow: integration of variational autoencoders and normalizing flows for novel molecular design VNFlow：集成变分自编码器和归一化流的新分子设计

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-10-24 DOI: 10.1186/s13321-025-01104-2

Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi

Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.

生成式人工智能（Generative Artificial Intelligence）正在通过探索广阔的、基本上未被探索的化学空间，改变分子的发现。然而，目前的方法，包括归一化流程，难以平衡复杂目标和采样速度的优化，特别是在生成特定化合物类别和更复杂的支架（如芳香环）时。这项工作开发了一种生成模型，可以有效地对新分子进行采样，同时优化其药物相似性，易于合成或化学反应性。为了实现这一目标，我们采用归一化流与变分自编码器相结合的方法来生成样本，并对样本进行药物相似性定量评估、合成可及性评分，以及在有机氟磷酸盐的情况下，中心磷原子上的电子密度（由密度泛函理论计算的Hirschfeld电荷近似）进行评估。我们的框架有效地生成了各种有机氟磷酸盐，表明将归一化流直接与自拍照或群自拍相结合可以解决逆分子设计的关键限制，特别是当由于缺乏可用的训练数据而无法应用变分自编码器时。正常化流动以整体的方式捕获化学结构，这为靶向治疗铺平了道路，使复杂的分子目标能够优化。

{"title":"VNFlow: integration of variational autoencoders and normalizing flows for novel molecular design","authors":"Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi","doi":"10.1186/s13321-025-01104-2","DOIUrl":"10.1186/s13321-025-01104-2","url":null,"abstract":"<div><p>Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01104-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gromacs MetaDump: a tool for extracting GROMACS simulation metadata Gromacs MetaDump：用于提取Gromacs模拟元数据的工具。

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-10-23 DOI: 10.1186/s13321-025-01082-5

Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová

The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (gmx dump) output. The tool takes the run input simulation file (.tpr) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (.top, .gro). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.

This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.

通过公共存储库共享的分子动力学（MD）模拟数据量正在迅速增加；然而，跨多个独立存储库的碎片化（每个存储库使用不同的数据集标识符和元数据模式）阻碍了对这些数据的有效探索和重用。在本研究中，我们介绍了GROMACS MetaDump，这是一种工具，用于自动注释GROMACS MD模拟的输出文件，利用本机GROMACS元数据（gmx转储）输出生成人类和机器可读的元数据。该工具将运行输入模拟文件（.tpr）作为元数据输出的基础，可以选择使用拓扑和结构文件（.tpr）中的注释对其进行扩展。前,.gro)。该工具以web应用程序（https://gmd.ceitec.cz/）、API服务和命令行实用程序的形式提供。通过自动化元数据提取过程，GROMACS MetaDump旨在简化和标准化从GROMACS MD模拟中提取丰富的结构化元数据，使其更容易在研究社区内共享、发现和重用模拟数据。科学贡献：这项工作介绍了GROMACS MetaDump，这是一个软件工具，用于从使用GROMACS进行的分子动力学（MD）模拟中自动提取元数据。GROMACS MetaDump捕获所有可提取的模拟参数，如软件版本、力场、水模型、盒子几何形状、温度等，并以结构化JSON或YAML文件返回。因此，GROMACS MetaDump支持创建MD模拟的统一元数据注释，使数据集符合FAIR原则，可索引和可查找。

{"title":"Gromacs MetaDump: a tool for extracting GROMACS simulation metadata","authors":"Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová","doi":"10.1186/s13321-025-01082-5","DOIUrl":"10.1186/s13321-025-01082-5","url":null,"abstract":"<p>The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (<span>gmx dump</span>) output. The tool takes the run input simulation file (<span>.tpr</span>) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (<span>.top</span>, <span>.gro</span>). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.</p><p>This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01082-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0