首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and d-amino acids 高折叠- med:一个罗塞塔蒸馏模型加速结构预测与主干n -甲基化和d-氨基酸环肽
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-10 DOI: 10.1186/s13321-025-01111-3
Zhigang Cao, Sen Cao, Linghong Wang, Zhiguo Wang, Qingyi Mao, Jingjing Guo, Hongliang Duan

Abstract

Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.

Scientific contribution

We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.

含有主干n -甲基化氨基酸(BNMeAAs)和d -氨基酸(D-AAs)的环肽因其稳定性、膜渗透性和其他治疗潜力而受到越来越多的关注。目前,Rosetta simple_cycpep_predict (SCP)可以通过基于能量的计算来预测它们的结构,但这种方法计算量大,耗时长。此外,这种环肽的可用晶体结构仍然非常有限,阻碍了数据驱动结构预测模型的发展。为了应对这些挑战,我们提出了HighFold-MeD,这是一个基于深度学习的框架,通过微调AlphaFold模型从Rosetta SCP中提取知识。首先,利用Rosetta SCP软件,对含有BNMeAAs和D-AAs的环状肽的大量构象进行采样,并对其能量评分进行评估,构建环状肽结构数据集。然后对AlphaFold模型进行微调,以纳入扩展的56个bnmeaa和d - aa。此外,引入了一个相对位置循环矩阵来显式地模拟头尾循环。最后,在预测结构中采用力场最小化空间冲突。实证实验表明,HighFold-MeD通过Rosetta的SCP模块,在采样数据集的基础上,以nstruct = 500和cyclic_peptide: genkic_closure_attempts = 1000为关键参数,实现了与Rosetta相当的准确率,同时将结构预测速度提高了50倍,从而显著加快了基于环肽的治疗药物的开发。我们提出了HighFold-MeD,它提供了一种快速和相对准确的方法来预测含有主链n -甲基化氨基酸和d-氨基酸的环肽的结构,这些环肽是肽药物设计的关键组成部分。通过将特定参数下的Rosetta SCP知识提取到经过微调的AlphaFold框架中,我们的方法在保持相对较高准确性的同时实现了50倍的加速,从而实现了大规模环肽药物设计。
{"title":"HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and d-amino acids","authors":"Zhigang Cao,&nbsp;Sen Cao,&nbsp;Linghong Wang,&nbsp;Zhiguo Wang,&nbsp;Qingyi Mao,&nbsp;Jingjing Guo,&nbsp;Hongliang Duan","doi":"10.1186/s13321-025-01111-3","DOIUrl":"10.1186/s13321-025-01111-3","url":null,"abstract":"<div><h3>Abstract</h3><p>Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics.</p><h3>Scientific contribution</h3><p>We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids—key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01111-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145478487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework 利用多模态深度学习框架准确预测抗氧化肽的构效关系
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-03 DOI: 10.1186/s13321-025-01102-4
Huynh Anh Duy, Tarapong Srisongkram

Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.

抗氧化肽(AOPs)因其抗氧化应激相关疾病的功效以及在功能性食品和化妆品行业的应用而成为一种很有前景的肽制剂。在这项研究中,我们利用多模态深度学习框架开发了一个全面的定量结构-活动关系(QSAR),该框架将6种基于序列的结构表示与堆叠集成神经架构(卷积神经网络、双向长短期记忆和transformer)集成在一起,以提高预测准确性。此外,我们采用生成模型来设计新的AOP候选对象,随后使用性能最佳的QSAR模型对其进行评估。值得注意的是,使用单热编码的叠加模型取得了出色的预测指标,精度、AUROC和AUPRC值超过0.90,MCC值超过0.80,证明了QSAR模型的高准确性和鲁棒性。SHAP分析强调,脯氨酸、亮氨酸、丙氨酸、酪氨酸和甘氨酸是对抗氧化活性有正面影响的前五大残基,而蛋氨酸、半胱氨酸、色氨酸、天冬酰胺和苏氨酸对抗氧化活性有负面影响。最后,通过计算确定了604个高置信度的aop。该研究表明,多模态框架提高了AOP的预测精度、鲁棒性和可解释性。它还能够有效地发现高潜力的AOPs,从而为加速药物和功能应用中的肽发现提供了强大的管道。本研究引入了一种新的多模态QSAR框架,该框架结合了不同序列表示和堆叠神经网络来增强抗氧化肽的预测。通过整合SHAP分析和生成建模,它可以实现可解释的特征归属和604个高电位肽的发现。该框架实现了卓越的预测性能,并为加速制药和功能食品应用中抗氧化肽的发现提供了一个可重复的、有效的管道。
{"title":"Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework","authors":"Huynh Anh Duy,&nbsp;Tarapong Srisongkram","doi":"10.1186/s13321-025-01102-4","DOIUrl":"10.1186/s13321-025-01102-4","url":null,"abstract":"<div><p>Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures–convolutional neural networks, bidirectional long short-term memory, and Transformer—to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01102-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145427985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework 测量化学LLM对分子表征的稳健性:一个基于SMILES变化的框架
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-30 DOI: 10.1186/s13321-025-01079-0
Veronika Ganeeva, Kuzma Khrabrov, Artur Kadurin, Elena Tutubalina

The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework.

最近自然语言处理与化学的结合促进了药物的发现。语言模型(LMs)中的分子表示对于增强化学理解至关重要。我们探索了模型匹配相同化学结构的能力,尽管它们的表示不同。在不同的表述中认识同一物质是模拟理解化学如何工作的重要组成部分。我们提出了增强分子检索(AMORE),这是一个灵活的零射击框架,用于评估不同性质的化学lm。该框架基于维持基本化学结构的SMILES增强。所提出的方法促进了分子的嵌入表示及其SMILES变化与另一个分子的嵌入表示之间的相似性。实验表明,所测试的chemllm对不同的smile表示仍然不具有鲁棒性。我们在各种任务上对模型进行了评估,包括ChEBI-20基准上的分子标注和MoleculeNet基准上的分类和回归任务。我们表明,当SMILES字符串的变化与所提出的AMORE框架一致时,结果会发生变化。我们提出了一个基于化学语言模型(ChemLMs)内部嵌入空间的评估框架AMORE。AMORE使用增强技术,重新制定smile的分子结构串,以评估化学表征。提出的框架允许在没有昂贵的手动注释数据的情况下评估chemlm。
{"title":"Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework","authors":"Veronika Ganeeva,&nbsp;Kuzma Khrabrov,&nbsp;Artur Kadurin,&nbsp;Elena Tutubalina","doi":"10.1186/s13321-025-01079-0","DOIUrl":"10.1186/s13321-025-01079-0","url":null,"abstract":"<div><p>The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results’ change after SMILES strings variations align with the proposed AMORE framework. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01079-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient decoy selection to improve virtual screening using machine learning models 利用机器学习模型提高虚拟筛选的有效诱饵选择
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-30 DOI: 10.1186/s13321-025-01107-z
Felipe Victoria-Muñoz, Janosch Menke, Norberto Sanchez-Cruz, Oliver Koch

Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.

使用蛋白质配体相互作用指纹的机器学习模型在药物发现中显示出作为靶标特异性评分函数的希望,但它们的性能严重依赖于潜在的诱饵选择策略。认识到这在模型性能中的关键作用,分析了各种诱饵选择策略,以增强基于每原子蛋白质分数贡献衍生相互作用指纹(PADIF)的机器学习模型。我们探索了三种不同的诱饵选择工作流程:(1)从广泛的数据库(如ZINC15)中随机选择,(2)利用存储为暗化学物质的高通量筛选(HTS)分析中反复出现的非结合物,以及(3)利用对接结果的不同构象来增强数据。ChEMBL的活性分子与这些诱饵方法相结合,用于训练和测试基于PADIF的不同机器学习模型。最后的验证是通过从LIT-PCBA数据集中确认实验确定的非活性化合物来完成的。我们的研究结果表明,使用随机选择的ZINC15和来自暗化学物质的化合物训练的模型非常接近于使用实际非粘合剂训练的模型的性能,这为在缺乏特定非活性数据的情况下创建准确的模型提供了可行的替代方案。此外,所有模型都显示出针对特定靶点探索新化学空间的能力增强,并且比经典评分函数增强了对顶级活性化合物的选择,从而提高了分子对接的筛选能力。这些发现表明,适当的诱饵选择策略可以在缺乏大量实验数据的情况下保持模型的准确性,同时扩大对目标的适用性。本研究增强了诱饵从公共数据集创建新的生物活性机器学习分类模型的可用选项。通过分子对接协议和PADIF表示,我们确定了使用该方法改进分子对接中粘合剂选择的好处
{"title":"Efficient decoy selection to improve virtual screening using machine learning models","authors":"Felipe Victoria-Muñoz,&nbsp;Janosch Menke,&nbsp;Norberto Sanchez-Cruz,&nbsp;Oliver Koch","doi":"10.1186/s13321-025-01107-z","DOIUrl":"10.1186/s13321-025-01107-z","url":null,"abstract":"<div><p>Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01107-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing molecular property prediction through data integration and consistency assessment 通过数据整合和一致性评估加强分子性质预测
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-29 DOI: 10.1186/s13321-025-01103-3
Raquel Parrondo-Pizarro, Luca Menestrina, Ricard Garcia-Serna, Adrià Fernández-Torras, Jordi Mestres

Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.

数据异质性和分布偏差对机器学习模型构成了严峻的挑战,往往会影响预测的准确性。这些挑战体现在临床前安全性建模中,这是早期药物发现的关键一步,有限的数据和实验限制加剧了整合问题。通过分析公共ADME数据集,我们发现了金标准和流行基准资源(如Therapeutic Data Commons)之间的显著不一致以及不一致的属性注释。这些数据集差异可能源于各种因素的差异,包括数据收集中的实验条件以及化学空间覆盖范围,这些差异可能引入噪声并最终降低模型性能。数据标准化,尽管协调差异和增加训练集的大小,可能并不总是导致预测性能的改进。这突出了在建模之前严格的数据一致性评估(DCA)的重要性。为了促进跨不同数据集的系统DCA,我们开发了AssayInspector,这是一个模型不可知的软件包,它利用统计数据、可视化和诊断摘要来识别异常值、批处理效果和差异。除了临床前安全性之外,DCA还可以在联邦学习场景中发挥关键作用,实现跨异构数据源的有效迁移学习,并支持跨不同科学领域的可靠集成。通过系统地分析公共ADME数据集,我们发现了基准和金标准来源之间的大量分布不一致和注释差异。这些挑战被证明会破坏预测建模,因为幼稚的集成或标准化通常会降低性能。为了解决这些问题,我们提出了AssayInspector,这是一种能够进行一致性评估和知情数据集成的工具,为药物发现中更可靠的预测建模提供了基础。
{"title":"Enhancing molecular property prediction through data integration and consistency assessment","authors":"Raquel Parrondo-Pizarro,&nbsp;Luca Menestrina,&nbsp;Ricard Garcia-Serna,&nbsp;Adrià Fernández-Torras,&nbsp;Jordi Mestres","doi":"10.1186/s13321-025-01103-3","DOIUrl":"10.1186/s13321-025-01103-3","url":null,"abstract":"<div><p>Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01103-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145397999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions 一个用于分配聚合反应中单体smile的头部和尾部位置的开源软件。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-28 DOI: 10.1186/s13321-025-01098-x
Brenda de Souza Ferrari, Ronaldo Giro, Mathias B. Steiner

Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.

The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.

人工智能(AI)技术正在改变聚合物的计算发现和设计。聚合物信息学的关键促成因素是机器可读的聚合物构建块(即单体)的分子字符串表示。在单体链中,如SMILES,头部和尾部原子的符号表示聚合过程中键形成的位置。由于单体的连接决定了聚合物的性质,因此人工智能预测模型的性能最终将受到单体SMILES中头部和尾部分配的准确性的限制。考虑到化学数据库中有大量的聚合物前体,需要可靠的方法来自动分配头尾原子。在这里,我们报告了一种基于亲核性原子指数,通过分析其官能团的反应性来分配单体SMILES头尾原子的方法。在包含206个聚合物前体的参考数据集中,HeadTailAssign (HTA)算法正确预测了204个单体SMILES的聚合物类别,准确率达到99%。头部和尾部原子被正确地分配给187个单体smile,准确率为91%。HTA代码可在https://github.com/IBM/HeadTailAssign上进行验证和重用。科学贡献:该算法成功地应用于数据预处理,通过标记单体的链接键来定义聚合反应中的重复单元。
{"title":"HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions","authors":"Brenda de Souza Ferrari,&nbsp;Ronaldo Giro,&nbsp;Mathias B. Steiner","doi":"10.1186/s13321-025-01098-x","DOIUrl":"10.1186/s13321-025-01098-x","url":null,"abstract":"<p> Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.</p><p>The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01098-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VNFlow: integration of variational autoencoders and normalizing flows for novel molecular design VNFlow:集成变分自编码器和归一化流的新分子设计
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-24 DOI: 10.1186/s13321-025-01104-2
Jiří Hostaš, Mohammad S. Ghaemi, Hang Hu, Junan Lin, Anguang Hu, Hsu K. Ooi

Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.

生成式人工智能(Generative Artificial Intelligence)正在通过探索广阔的、基本上未被探索的化学空间,改变分子的发现。然而,目前的方法,包括归一化流程,难以平衡复杂目标和采样速度的优化,特别是在生成特定化合物类别和更复杂的支架(如芳香环)时。这项工作开发了一种生成模型,可以有效地对新分子进行采样,同时优化其药物相似性,易于合成或化学反应性。为了实现这一目标,我们采用归一化流与变分自编码器相结合的方法来生成样本,并对样本进行药物相似性定量评估、合成可及性评分,以及在有机氟磷酸盐的情况下,中心磷原子上的电子密度(由密度泛函理论计算的Hirschfeld电荷近似)进行评估。我们的框架有效地生成了各种有机氟磷酸盐,表明将归一化流直接与自拍照或群自拍相结合可以解决逆分子设计的关键限制,特别是当由于缺乏可用的训练数据而无法应用变分自编码器时。正常化流动以整体的方式捕获化学结构,这为靶向治疗铺平了道路,使复杂的分子目标能够优化。
{"title":"VNFlow: integration of variational autoencoders and normalizing flows for novel molecular design","authors":"Jiří Hostaš,&nbsp;Mohammad S. Ghaemi,&nbsp;Hang Hu,&nbsp;Junan Lin,&nbsp;Anguang Hu,&nbsp;Hsu K. Ooi","doi":"10.1186/s13321-025-01104-2","DOIUrl":"10.1186/s13321-025-01104-2","url":null,"abstract":"<div><p>Generative Artificial Intelligence is transforming the molecular discovery by enabling exploration of the vast, largely unexplored chemical space. However, current methods, including normalizing flows, struggle to balance the optimization of complex objectives and sampling speed, particularly when generating specific compound classes and more intricate scaffolds, such as aromatic rings. This work developed a generative model that efficiently samples novel molecules while optimizing their drug-likeness, ease of synthesis or chemical reactivity. To achieve this, we employed normalizing flows combined with variational autoencoders to generate samples which were evaluated for the Quantitative Estimate of Drug-likeness, the Synthetic Accessibility scores and, in case of organofluorine-phosphates, electronic density on the central phosphorus atom, approximated by Hirschfeld charges calculated with density functional theory. Our framework efficiently generated a diverse range of organofluorine-phosphates, demonstrating that combining normalizing flows directly with SELFIES or group-SELFIES can address key limitations in inverse molecular design, particularly when variational autoencoders cannot be applied due to a lack of available training data. Normalizing flows capture the chemical structures in a holistic way which paves the way towards targeted therapies that enable the optimization of complex molecular objectives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01104-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gromacs MetaDump: a tool for extracting GROMACS simulation metadata Gromacs MetaDump:用于提取Gromacs模拟元数据的工具。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-23 DOI: 10.1186/s13321-025-01082-5
Adrián Rošinec, Terézia Slanináková, Tomáš Pavlík, Róbert Randiak, Tomáš Svoboda, Tomáš Raček, Gabriela Bučeková, Ondřej Schindler, Matej Antol, Aleš Křenek, Karel Berka, Radka Svobodová

The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (gmx dump) output. The tool takes the run input simulation file (.tpr) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (.top, .gro). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.

This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.

通过公共存储库共享的分子动力学(MD)模拟数据量正在迅速增加;然而,跨多个独立存储库的碎片化(每个存储库使用不同的数据集标识符和元数据模式)阻碍了对这些数据的有效探索和重用。在本研究中,我们介绍了GROMACS MetaDump,这是一种工具,用于自动注释GROMACS MD模拟的输出文件,利用本机GROMACS元数据(gmx转储)输出生成人类和机器可读的元数据。该工具将运行输入模拟文件(.tpr)作为元数据输出的基础,可以选择使用拓扑和结构文件(.tpr)中的注释对其进行扩展。前,.gro)。该工具以web应用程序(https://gmd.ceitec.cz/)、API服务和命令行实用程序的形式提供。通过自动化元数据提取过程,GROMACS MetaDump旨在简化和标准化从GROMACS MD模拟中提取丰富的结构化元数据,使其更容易在研究社区内共享、发现和重用模拟数据。科学贡献:这项工作介绍了GROMACS MetaDump,这是一个软件工具,用于从使用GROMACS进行的分子动力学(MD)模拟中自动提取元数据。GROMACS MetaDump捕获所有可提取的模拟参数,如软件版本、力场、水模型、盒子几何形状、温度等,并以结构化JSON或YAML文件返回。因此,GROMACS MetaDump支持创建MD模拟的统一元数据注释,使数据集符合FAIR原则,可索引和可查找。
{"title":"Gromacs MetaDump: a tool for extracting GROMACS simulation metadata","authors":"Adrián Rošinec,&nbsp;Terézia Slanináková,&nbsp;Tomáš Pavlík,&nbsp;Róbert Randiak,&nbsp;Tomáš Svoboda,&nbsp;Tomáš Raček,&nbsp;Gabriela Bučeková,&nbsp;Ondřej Schindler,&nbsp;Matej Antol,&nbsp;Aleš Křenek,&nbsp;Karel Berka,&nbsp;Radka Svobodová","doi":"10.1186/s13321-025-01082-5","DOIUrl":"10.1186/s13321-025-01082-5","url":null,"abstract":"<p>The volume of molecular dynamics (MD) simulation data shared via public repositories is rapidly increasing; however, fragmentation across multiple independent repositories, each employing distinct dataset identifiers and metadata schemas, hinders the efficient exploration and reuse of these data. In this study, we present GROMACS MetaDump, a tool for automatic annotation of output files from GROMACS MD simulations producing human- and machine-readable metadata leveraging the native GROMACS metadata (<span>gmx dump</span>) output. The tool takes the run input simulation file (<span>.tpr</span>) as the basis for the metadata output, optionally extending it with annotations from topology and structure files (<span>.top</span>, <span>.gro</span>). The tool is available as a web application (https://gmd.ceitec.cz/), API service, and a command-line utility. By automating the metadata extraction process, GROMACS MetaDump aims to simplify and standardise the extraction of rich, structured metadata from GROMACS MD simulations, making it easier to share, discover, and reuse simulation data within the research community.</p><p>This work introduces GROMACS MetaDump, a software tool for the automatic extraction of metadata from molecular dynamics (MD) simulations performed with GROMACS. GROMACS MetaDump captures all extractable simulation parameters, such as software version, force field, water model, box geometry, temperature, etc., and returns them in a structured JSON or YAML file. As a result, GROMACS MetaDump supports the creation of unified metadata annotations of MD simulations, making datasets indexable and findable in line with the FAIR principles.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01082-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: Enhanced Thompson sampling by roulette wheel selection for screening ultralarge combinatorial libraries 更正:通过轮盘赌选择筛选超大型组合库来增强汤普森采样。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-21 DOI: 10.1186/s13321-025-01113-1
Hongtao Zhao, Eva Nittinger, Melissa A. Yu, Symon Gathiaka, W. Patrick Walters, Christian Tyrchan
{"title":"Correction: Enhanced Thompson sampling by roulette wheel selection for screening ultralarge combinatorial libraries","authors":"Hongtao Zhao,&nbsp;Eva Nittinger,&nbsp;Melissa A. Yu,&nbsp;Symon Gathiaka,&nbsp;W. Patrick Walters,&nbsp;Christian Tyrchan","doi":"10.1186/s13321-025-01113-1","DOIUrl":"10.1186/s13321-025-01113-1","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01113-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145342639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of UGT-mediated phase II metabolism via ligand- and structure-based predictive models 通过基于配体和结构的预测模型预测ugt介导的II期代谢
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-15 DOI: 10.1186/s13321-025-01097-y
Ludovica Bono, Filippo Lunghini, Emanuela Sabato, Akash Deep Biswas, Angelica Mazzolari, Alessandro Pedretti, Andrea R. Beccari, Giulio Vistoli, Serena Vittorio

The prediction of drugs metabolism by in silico techniques is gaining a growing interest due to the possibility to process large datasets allowing the stability and safety of new drug candidates to be evaluated during the early stages of the drug discovery process. To date, in silico models for metabolism prediction mainly exploits the ligand-based (LB) properties of the training molecules to predict the occurrence of a given metabolic reaction and/or the reactive site involved in the biotransformation. However, recent reports highlighted that structure-based (SB) modeling can be conveniently integrated with LB methods for drug metabolism prediction purpose, with the advantages to predict if a given molecule can fit the enzyme active site and which moiety approaches the catalytic residues. Herein, we developed machine learning models for UDP-glucuronosyltransferase (UGT)-mediated metabolism by using both LB and SB methods. In particular, this study was focused on UGT2B7 and UGT2B15 isoforms which are involved in the clearance of many drugs as well as in clinically relevant drug-drug interactions. First, molecular dynamics (MD) and docking simulations were combined to explore the binding mechanism of cofactor and substrate within the catalytic pocket of the studied UGT isoforms exploiting their AlphaFold structures. The analysis of the MD trajectories allowed an appropriate conformation of both UGT isoforms to be identified for the development of binary classification models. For this purpose, Random Forest algorithm and the metabolic data extracted from the MetaQSAR database were used. SB models were trained on a set of scoring functions and protein–ligand interaction fingerprints derived from docking, while the LB models were built on a set of physicochemical and constitutional descriptors. When the single models were evaluated, the LB classifiers outperformed the SB models. However, the application of a consensus strategy led to an improvement of the prediction accuracy if compared to the individual models, highlighting that LB and SB approaches convey complementary information whose aggregation allowed us to achieve better predictions than the single models.

由于处理大型数据集的可能性,在药物发现过程的早期阶段可以评估新候选药物的稳定性和安全性,因此通过硅技术预测药物代谢正在获得越来越多的兴趣。迄今为止,代谢预测的计算机模型主要利用训练分子的基于配体(LB)的特性来预测给定代谢反应的发生和/或生物转化中涉及的反应位点。然而,最近的报道强调,基于结构(SB)的建模可以方便地与LB方法相结合,用于药物代谢预测,其优点是预测给定分子是否适合酶的活性位点以及哪些部分接近催化残基。在此,我们利用LB和SB方法建立了udp -葡萄糖醛酸转移酶(UGT)介导代谢的机器学习模型。本研究特别关注UGT2B7和UGT2B15亚型,它们参与了许多药物的清除以及临床相关的药物-药物相互作用。首先,将分子动力学(MD)和对接模拟相结合,利用所研究的UGT同工型的AlphaFold结构,探索其催化口袋内辅因子和底物的结合机制。MD轨迹的分析允许两个UGT同种异构体的适当构象被确定为二元分类模型的发展。为此,我们使用随机森林算法和MetaQSAR数据库中提取的代谢数据。SB模型是基于一组评分函数和对接得到的蛋白质-配体相互作用指纹进行训练的,而LB模型是基于一组物理化学和结构描述符进行训练的。当评估单个模型时,LB分类器优于SB模型。然而,与单个模型相比,共识策略的应用导致了预测精度的提高,突出表明LB和SB方法传达了互补信息,其聚合使我们能够实现比单个模型更好的预测。通过计算机方法进行代谢预测是在药物发现的早期阶段评估新候选药物的药代动力学特征的有用工具。该研究提供了一种新的计算策略,可以整合基于配体和结构的方法,利用UGT2B7和ugt2b15的AlphaFold结构来预测其介导的代谢。与单个配体和基于结构的预测模型相比,这两种方法的结合产生了更高的性能,也证实了AlphaFold结构在开发基于结构的代谢预测模型方面的可靠性。
{"title":"Prediction of UGT-mediated phase II metabolism via ligand- and structure-based predictive models","authors":"Ludovica Bono,&nbsp;Filippo Lunghini,&nbsp;Emanuela Sabato,&nbsp;Akash Deep Biswas,&nbsp;Angelica Mazzolari,&nbsp;Alessandro Pedretti,&nbsp;Andrea R. Beccari,&nbsp;Giulio Vistoli,&nbsp;Serena Vittorio","doi":"10.1186/s13321-025-01097-y","DOIUrl":"10.1186/s13321-025-01097-y","url":null,"abstract":"<div><p>The prediction of drugs metabolism by in silico techniques is gaining a growing interest due to the possibility to process large datasets allowing the stability and safety of new drug candidates to be evaluated during the early stages of the drug discovery process. To date, in silico models for metabolism prediction mainly exploits the ligand-based (LB) properties of the training molecules to predict the occurrence of a given metabolic reaction and/or the reactive site involved in the biotransformation. However, recent reports highlighted that structure-based (SB) modeling can be conveniently integrated with LB methods for drug metabolism prediction purpose, with the advantages to predict if a given molecule can fit the enzyme active site and which moiety approaches the catalytic residues. Herein, we developed machine learning models for UDP-glucuronosyltransferase (UGT)-mediated metabolism by using both LB and SB methods. In particular, this study was focused on UGT2B7 and UGT2B15 isoforms which are involved in the clearance of many drugs as well as in clinically relevant drug-drug interactions. First, molecular dynamics (MD) and docking simulations were combined to explore the binding mechanism of cofactor and substrate within the catalytic pocket of the studied UGT isoforms exploiting their AlphaFold structures. The analysis of the MD trajectories allowed an appropriate conformation of both UGT isoforms to be identified for the development of binary classification models. For this purpose, Random Forest algorithm and the metabolic data extracted from the MetaQSAR database were used. SB models were trained on a set of scoring functions and protein–ligand interaction fingerprints derived from docking, while the LB models were built on a set of physicochemical and constitutional descriptors. When the single models were evaluated, the LB classifiers outperformed the SB models. However, the application of a consensus strategy led to an improvement of the prediction accuracy if compared to the individual models, highlighting that LB and SB approaches convey complementary information whose aggregation allowed us to achieve better predictions than the single models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01097-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1