首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Cosynllm: predicting drug combination synergy with LLM-generated descriptions. Cosynllm:用llm生成的描述预测药物联合协同作用。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-05 DOI: 10.1186/s13321-026-01158-w
Suwan Mao, Wenjie Tang, Li Li, Mang Jing, Yun Liu, Junjie Wang

Drug combination therapy is a well-established strategy for treating complex diseases. However, the vast combinatorial space renders exhaustive experimental screening impractical and costly. Recent studies have shown that deep learning techniques can effectively prioritize synergistic drug combinations by leveraging their powerful nonlinear modeling and automatic feature extraction capabilities. Meanwhile, Large Language Models (LLMs) offer great promise in drug discovery. In this paper, we propose CoSynLLM, an LLM-assisted predictive framework for predicting drug combination synergy. We fully leverage the latent knowledge embedded in LLMs to generate semantic-level chemical information, complemented by drug fingerprints to incorporate explicit structural details, while cell line gene expression profiles represent the cellular context. To effectively merge drug and cell line representations, a hierarchical feature fusion strategy is employed to progressively integrate features through multiple stages for predicting drug combination synergy. Extensive experiments on two benchmark datasets, NCI-ALMANAC and O'Neil, demonstrate that CoSynLLM achieves competitive performance, highlighting its effectiveness in predicting drug combination synergy. In summary, CoSynLLM effectively identifies synergistic drug combinations, offering a robust and practical computational framework for predicting drug combination synergy.

药物联合治疗是治疗复杂疾病的一种行之有效的策略。然而,巨大的组合空间使得详尽的实验筛选不切实际且成本高昂。最近的研究表明,深度学习技术可以利用其强大的非线性建模和自动特征提取能力,有效地优先考虑协同药物组合。与此同时,大型语言模型(llm)在药物发现方面提供了巨大的希望。在本文中,我们提出了CoSynLLM,一个llm辅助的预测框架,用于预测药物联合协同作用。我们充分利用llm中嵌入的潜在知识来生成语义级化学信息,辅以药物指纹来结合明确的结构细节,而细胞系基因表达谱则代表细胞背景。为了有效地合并药物和细胞系表示,采用分层特征融合策略,通过多个阶段逐步整合特征,以预测药物联合协同作用。在NCI-ALMANAC和O'Neil两个基准数据集上进行的大量实验表明,CoSynLLM取得了具有竞争力的性能,突出了其在预测药物联合协同作用方面的有效性。综上所述,CoSynLLM有效识别协同药物组合,为预测药物联合协同作用提供了一个强大而实用的计算框架。
{"title":"Cosynllm: predicting drug combination synergy with LLM-generated descriptions.","authors":"Suwan Mao, Wenjie Tang, Li Li, Mang Jing, Yun Liu, Junjie Wang","doi":"10.1186/s13321-026-01158-w","DOIUrl":"https://doi.org/10.1186/s13321-026-01158-w","url":null,"abstract":"<p><p>Drug combination therapy is a well-established strategy for treating complex diseases. However, the vast combinatorial space renders exhaustive experimental screening impractical and costly. Recent studies have shown that deep learning techniques can effectively prioritize synergistic drug combinations by leveraging their powerful nonlinear modeling and automatic feature extraction capabilities. Meanwhile, Large Language Models (LLMs) offer great promise in drug discovery. In this paper, we propose CoSynLLM, an LLM-assisted predictive framework for predicting drug combination synergy. We fully leverage the latent knowledge embedded in LLMs to generate semantic-level chemical information, complemented by drug fingerprints to incorporate explicit structural details, while cell line gene expression profiles represent the cellular context. To effectively merge drug and cell line representations, a hierarchical feature fusion strategy is employed to progressively integrate features through multiple stages for predicting drug combination synergy. Extensive experiments on two benchmark datasets, NCI-ALMANAC and O'Neil, demonstrate that CoSynLLM achieves competitive performance, highlighting its effectiveness in predicting drug combination synergy. In summary, CoSynLLM effectively identifies synergistic drug combinations, offering a robust and practical computational framework for predicting drug combination synergy.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146117098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-efficient learning for accurate identification of MAPK1 inhibitors using an active meta-deep learning framework. 使用主动元深度学习框架准确识别MAPK1抑制剂的数据高效学习。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-03 DOI: 10.1186/s13321-026-01159-9
Darlene Nabila Zetta, Tarapong Srisongkram

Limited experimental data remains a key challenge in applying machine learning to drug discovery, particularly for cancer-related targets. In this study, we present a data-efficient active meta-deep learning framework to predict mitogen-activated protein kinase 1 (MAPK1) inhibitors, which are promising candidates for cancer-related therapies. Our approach integrates active learning (AL) with a meta-model that combines four deep architectures: a convolutional neural network, an attention, a graph convolutional network, and a graph neural network-attention, trained on molecular descriptors and graph-based representations. These models generate four probability-based features that feed into an attention-based meta-learner, improving predictive performance by 5.12% in the area under the precision-recall curve (AUPRC) and 5.48% in the Matthews correlation coefficient (MCC) using only 10% of the training data. Among the AL sampling strategies evaluated, entropy sampling showed competitive performance in selecting informative molecules for model improvement. Overall, our framework achieves an AUPRC of 0.835 ± 0.017 and MCC of 0.817 ± 0.017, on par with a traditional training method despite using only 26.7% of the training data. Compared to a conventional random forest model trained on brute-force, a 100% full training set, our approach shows a 10.6% improvement in AUPRC and modest gains in MCC, confirming the effectiveness of the proposed framework. Under severe class imbalance, balanced accuracy steadily increased across AL iterations, reaching values greater than 0.85 at the final iteration for all uncertainty-driven strategies. Molecular docking confirmed successful prioritization of the top four predicted compounds. Evaluation on an external MAPK1 data set demonstrated generalizability, with our approach achieving an AUPRC of 0.818 and an MCC of 0.403, comparable to the independent test set. These results highlight the potential of combining intelligent data selection with deep learning architectures through the meta-model to accelerate predictive performance in data-scarce drug discovery. Scientific contribution: This study contributes a novel, data-efficient active meta-deep learning framework for predicting MAPK1 inhibitors, addressing the challenge of limited experimental data in a cancer-specific target. By integrating AL with a meta-model composed of four deep architectures, the approach significantly enhances the predictive performance using only a fraction of the training data. The framework achieves superior metrics compared to traditional training methods, highlighting its potential to accelerate drug discovery in data-scarce settings.

有限的实验数据仍然是将机器学习应用于药物发现,特别是癌症相关靶点的关键挑战。在这项研究中,我们提出了一个数据高效的主动元深度学习框架来预测丝裂原活化蛋白激酶1 (MAPK1)抑制剂,这些抑制剂是癌症相关治疗的有希望的候选药物。我们的方法将主动学习(AL)与元模型集成在一起,该元模型结合了四个深度架构:卷积神经网络,注意力,图卷积网络和图神经网络-注意力,在分子描述符和基于图的表示上进行训练。这些模型生成了四个基于概率的特征,并将其输入到基于注意力的元学习器中,仅使用10%的训练数据,就能将准确率-召回率曲线(AUPRC)下的区域和马修斯相关系数(MCC)的预测性能提高5.12%和5.48%。在评估的人工智能采样策略中,熵采样在选择信息分子以改进模型方面表现出竞争力。总体而言,我们的框架实现了AUPRC为0.835±0.017,MCC为0.817±0.017,与传统训练方法相当,尽管只使用了26.7%的训练数据。与使用暴力训练的传统随机森林模型(100%完整的训练集)相比,我们的方法在AUPRC上提高了10.6%,在MCC上也有适度的提高,证实了所提出框架的有效性。在严重的类不平衡情况下,平衡精度在人工智能迭代中稳步增加,在所有不确定性驱动策略的最终迭代中达到大于0.85的值。分子对接证实了前四个预测化合物的成功优先排序。对外部MAPK1数据集的评估显示了通用性,我们的方法实现了与独立测试集相当的AUPRC为0.818,MCC为0.403。这些结果突出了通过元模型将智能数据选择与深度学习架构相结合的潜力,以加速数据稀缺药物发现的预测性能。科学贡献:本研究为预测MAPK1抑制剂提供了一种新颖的、数据高效的主动元深度学习框架,解决了癌症特异性靶点实验数据有限的挑战。通过将人工智能与由四个深度体系结构组成的元模型集成,该方法仅使用一小部分训练数据就显着提高了预测性能。与传统训练方法相比,该框架实现了更好的指标,突出了其在数据稀缺环境中加速药物发现的潜力。
{"title":"Data-efficient learning for accurate identification of MAPK1 inhibitors using an active meta-deep learning framework.","authors":"Darlene Nabila Zetta, Tarapong Srisongkram","doi":"10.1186/s13321-026-01159-9","DOIUrl":"https://doi.org/10.1186/s13321-026-01159-9","url":null,"abstract":"<p><p>Limited experimental data remains a key challenge in applying machine learning to drug discovery, particularly for cancer-related targets. In this study, we present a data-efficient active meta-deep learning framework to predict mitogen-activated protein kinase 1 (MAPK1) inhibitors, which are promising candidates for cancer-related therapies. Our approach integrates active learning (AL) with a meta-model that combines four deep architectures: a convolutional neural network, an attention, a graph convolutional network, and a graph neural network-attention, trained on molecular descriptors and graph-based representations. These models generate four probability-based features that feed into an attention-based meta-learner, improving predictive performance by 5.12% in the area under the precision-recall curve (AUPRC) and 5.48% in the Matthews correlation coefficient (MCC) using only 10% of the training data. Among the AL sampling strategies evaluated, entropy sampling showed competitive performance in selecting informative molecules for model improvement. Overall, our framework achieves an AUPRC of 0.835 ± 0.017 and MCC of 0.817 ± 0.017, on par with a traditional training method despite using only 26.7% of the training data. Compared to a conventional random forest model trained on brute-force, a 100% full training set, our approach shows a 10.6% improvement in AUPRC and modest gains in MCC, confirming the effectiveness of the proposed framework. Under severe class imbalance, balanced accuracy steadily increased across AL iterations, reaching values greater than 0.85 at the final iteration for all uncertainty-driven strategies. Molecular docking confirmed successful prioritization of the top four predicted compounds. Evaluation on an external MAPK1 data set demonstrated generalizability, with our approach achieving an AUPRC of 0.818 and an MCC of 0.403, comparable to the independent test set. These results highlight the potential of combining intelligent data selection with deep learning architectures through the meta-model to accelerate predictive performance in data-scarce drug discovery. Scientific contribution: This study contributes a novel, data-efficient active meta-deep learning framework for predicting MAPK1 inhibitors, addressing the challenge of limited experimental data in a cancer-specific target. By integrating AL with a meta-model composed of four deep architectures, the approach significantly enhances the predictive performance using only a fraction of the training data. The framework achieves superior metrics compared to traditional training methods, highlighting its potential to accelerate drug discovery in data-scarce settings.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UGTformer: a pretrained graph transformer model for predicting UDP-glucuronosyltransferase-mediated drug metabolism. UGTformer:预训练的图形转换模型,用于预测udp -葡萄糖醛酸转移酶介导的药物代谢。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-03 DOI: 10.1186/s13321-026-01163-z
Meiling Zhan, Xiang Li, Le Xiong, Wenxiang Song, Jiaojiao Fang, Guixia Liu, Yun Tang, Weihua Li

UDP-glucuronosyltransferases (UGTs) play a critical role in drug metabolism by catalyzing the glucuronidation of structurally diverse compounds. However, accurately predicting UGT-mediated sites of metabolism (SOMs) remains a challenge due to the limited availability of annotated data. In this study, we introduce UGTformer, a unified graph transformer-based framework that simultaneously performs UGT substrate classification and SOM prediction. UGTformer employs a hierarchical architecture integrating multi-hop message propagation with hop-aware and node-level transformer encoders. The model was pretrained on large-scale molecular graphs via chemically informed self-supervised tasks, and fine-tuned on a manually curated UGT metabolism data set covering four major metabolic reaction categories. In five-fold cross-validation, UGTformer achieved an AUC of 0.833 for substrate classification and 0.884 for SOM identification, outperforming multiple GNN baselines. On an independent external validation set, it maintained robust performance, demonstrating strong generalization to previously unseen molecules. By integrating chemically meaningful structural encodings and a joint learning paradigm, UGTformer delivers interpretable and biologically consistent predictions, offering a reliable and scalable approach for UGT-related metabolism prediction. The UGTformer model is freely accessible at https://lmmd.ecust.edu.cn/UGTformer/.

udp -葡萄糖醛酸转移酶(UGTs)通过催化结构多样的化合物的葡萄糖醛酸化,在药物代谢中发挥关键作用。然而,由于注释数据的可用性有限,准确预测ugt介导的代谢位点(SOMs)仍然是一个挑战。在这项研究中,我们介绍了UGTformer,这是一个统一的基于图变压器的框架,同时执行UGT衬底分类和SOM预测。UGTformer采用分层结构,将多跳消息传播与跳感知和节点级变压器编码器集成在一起。该模型通过化学知识自监督任务在大规模分子图上进行预训练,并在手动策划的UGT代谢数据集上进行微调,这些数据集涵盖了四种主要的代谢反应类别。在五重交叉验证中,UGTformer在底物分类上的AUC为0.833,在SOM识别上的AUC为0.884,优于多个GNN基线。在独立的外部验证集上,它保持了强大的性能,展示了对以前未见过的分子的强大泛化。通过整合化学上有意义的结构编码和联合学习范式,UGTformer提供了可解释的和生物学上一致的预测,为ugt相关的代谢预测提供了可靠和可扩展的方法。UGTformer模型可以在https://lmmd.ecust.edu.cn/UGTformer/免费访问。
{"title":"UGTformer: a pretrained graph transformer model for predicting UDP-glucuronosyltransferase-mediated drug metabolism.","authors":"Meiling Zhan, Xiang Li, Le Xiong, Wenxiang Song, Jiaojiao Fang, Guixia Liu, Yun Tang, Weihua Li","doi":"10.1186/s13321-026-01163-z","DOIUrl":"https://doi.org/10.1186/s13321-026-01163-z","url":null,"abstract":"<p><p>UDP-glucuronosyltransferases (UGTs) play a critical role in drug metabolism by catalyzing the glucuronidation of structurally diverse compounds. However, accurately predicting UGT-mediated sites of metabolism (SOMs) remains a challenge due to the limited availability of annotated data. In this study, we introduce UGTformer, a unified graph transformer-based framework that simultaneously performs UGT substrate classification and SOM prediction. UGTformer employs a hierarchical architecture integrating multi-hop message propagation with hop-aware and node-level transformer encoders. The model was pretrained on large-scale molecular graphs via chemically informed self-supervised tasks, and fine-tuned on a manually curated UGT metabolism data set covering four major metabolic reaction categories. In five-fold cross-validation, UGTformer achieved an AUC of 0.833 for substrate classification and 0.884 for SOM identification, outperforming multiple GNN baselines. On an independent external validation set, it maintained robust performance, demonstrating strong generalization to previously unseen molecules. By integrating chemically meaningful structural encodings and a joint learning paradigm, UGTformer delivers interpretable and biologically consistent predictions, offering a reliable and scalable approach for UGT-related metabolism prediction. The UGTformer model is freely accessible at https://lmmd.ecust.edu.cn/UGTformer/.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MolGAN-QRL: a hybrid framework for molecule generation using quantum-enhanced reinforcement learning. MolGAN-QRL:使用量子增强强化学习的分子生成混合框架。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-02 DOI: 10.1186/s13321-025-01148-4
Mohamed Iheb Hergli, Emna Harigua-Souiai

Discovering novel drug candidates remains a considerable challenge in pharmaceutical research. Generative AI models such as Generative Adversarial Networks (GANs) have shown considerable promise in de novo molecular generation. They demonstrated high potential in drug discovery applications, yet they often face challenges such as limited chemical coverage and mode collapse. In the present study, we developed MolGAN-QRL, a hybrid quantum-classical framework that introduced quantum-enhanced reinforcement learning within the MolGAN architecture to address these limitations. The proposed framework leveraged a hybrid reward mechanism to further optimize chemical validity, uniqueness, and drug-likeliness of the generated molecules. Experimental results demonstrated that MolGAN-QRL consistently achieved enhanced generative performances compared to classical MolGAN, with up to a 16-fold increase in the count of unique and valid generated compounds under certain conditions. These gains reflected the effectiveness of quantum-guided exploration and highlighted the known trade-off between uniqueness and validity in generative chemistry. Overall, our findings underlined the value of quantum-enhanced reward modeling in mitigating mode collapse and advancing molecular generation, and support the potential of hybrid quantum-classical methods to advance generative chemistry for drug discovery applications. SCIENTIFIC CONTRIBUTION: MolGAN-QRL introduces the first variant of the MolGAN framework, that is augmented with a variational quantum circuit (VQC) within the reinforcement-learning reward module, rather than the generator, the discriminator or the noise function. It leverages a hybrid reward mechanism that trains a quantum-classical function that lead to better mitigation of mode-collapse, through higher uniqueness scores and 16-fold more novel, valid and unique molecules generated.

在药物研究中,发现新的候选药物仍然是一个相当大的挑战。生成式人工智能模型,如生成式对抗网络(gan)在从头分子生成方面显示出相当大的前景。它们在药物发现应用中显示出很高的潜力,但它们经常面临诸如有限的化学覆盖和模式崩溃等挑战。在本研究中,我们开发了MolGAN- qrl,这是一个混合量子-经典框架,在MolGAN架构中引入了量子增强强化学习来解决这些限制。所提出的框架利用混合奖励机制进一步优化所生成分子的化学有效性、独特性和药物可能性。实验结果表明,与传统的MolGAN相比,MolGAN- qrl的生成性能持续提高,在特定条件下,生成的独特有效化合物的数量增加了16倍。这些成果反映了量子引导探索的有效性,并突出了生成化学中已知的唯一性和有效性之间的权衡。总的来说,我们的研究结果强调了量子增强奖励模型在减轻模式崩溃和推进分子生成方面的价值,并支持混合量子-经典方法在药物发现应用中推进生成化学的潜力。科学贡献:MolGAN- qrl引入了MolGAN框架的第一个变体,它在强化学习奖励模块中增加了一个变分量子电路(VQC),而不是生成器、鉴别器或噪声函数。它利用一种混合奖励机制,通过更高的唯一性分数和16倍的新颖、有效和独特的分子生成,训练量子经典函数,从而更好地缓解模式崩溃。
{"title":"MolGAN-QRL: a hybrid framework for molecule generation using quantum-enhanced reinforcement learning.","authors":"Mohamed Iheb Hergli, Emna Harigua-Souiai","doi":"10.1186/s13321-025-01148-4","DOIUrl":"https://doi.org/10.1186/s13321-025-01148-4","url":null,"abstract":"<p><p>Discovering novel drug candidates remains a considerable challenge in pharmaceutical research. Generative AI models such as Generative Adversarial Networks (GANs) have shown considerable promise in de novo molecular generation. They demonstrated high potential in drug discovery applications, yet they often face challenges such as limited chemical coverage and mode collapse. In the present study, we developed MolGAN-QRL, a hybrid quantum-classical framework that introduced quantum-enhanced reinforcement learning within the MolGAN architecture to address these limitations. The proposed framework leveraged a hybrid reward mechanism to further optimize chemical validity, uniqueness, and drug-likeliness of the generated molecules. Experimental results demonstrated that MolGAN-QRL consistently achieved enhanced generative performances compared to classical MolGAN, with up to a 16-fold increase in the count of unique and valid generated compounds under certain conditions. These gains reflected the effectiveness of quantum-guided exploration and highlighted the known trade-off between uniqueness and validity in generative chemistry. Overall, our findings underlined the value of quantum-enhanced reward modeling in mitigating mode collapse and advancing molecular generation, and support the potential of hybrid quantum-classical methods to advance generative chemistry for drug discovery applications. SCIENTIFIC CONTRIBUTION: MolGAN-QRL introduces the first variant of the MolGAN framework, that is augmented with a variational quantum circuit (VQC) within the reinforcement-learning reward module, rather than the generator, the discriminator or the noise function. It leverages a hybrid reward mechanism that trains a quantum-classical function that lead to better mitigation of mode-collapse, through higher uniqueness scores and 16-fold more novel, valid and unique molecules generated.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPARFlow: a KNIME workflow for integrated structure-activity or structure-property relationship analysis. SPARFlow:用于集成结构-活动或结构-属性关系分析的KNIME工作流。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-02 DOI: 10.1186/s13321-026-01156-y
Elier E Abreu-Martínez, Karina Martinez-Mayorga, Gabriel Merino

We developed SPARFlow, an open-source KNIME workflow for structure-activity or structure-property relationship (SAR/SPR) analyses. The workflow integrates data preprocessing, chemical structure curation, similarity network construction, maximum common substructure detection, R-group decomposition, activity cliff identification, and database modelability assessment. It implements established indices, including SALI, SARI, MODI*, and RMODI, to characterize SAR landscapes and assess dataset suitability for predictive modeling. SPARFlow was validated using four datasets with distinct chemical and endpoint characteristics: cruzain inhibitors, biased μ-opioid receptor agonists, pesticides, and carbonyl compounds with hydration constants.Scientific ContributionThis work introduces SPARFlow, an KNIME-integrated workflow that combines data curation, activity-cliff detection, and modelability assessment for SAR and SPR studies. The workflow provides a unified implementation of key SAR analyses within a single KNIME pipeline. It updates implementations of established metrics, including MODI* and RMODI, together with complementary indices such as SARI and SALI. It ensures consistent data flow across all modules.

我们开发了SPARFlow,这是一个开源的KNIME工作流程,用于结构-活动或结构-属性关系(SAR/SPR)分析。该工作流集成了数据预处理、化学结构管理、相似网络构建、最大公共子结构检测、r群分解、活动悬崖识别和数据库可建模性评估。它实现了已建立的指数,包括SALI、SARI、MODI*和RMODI,以表征SAR景观并评估数据集对预测建模的适用性。SPARFlow使用四个具有不同化学和终点特征的数据集进行验证:cruzain抑制剂、偏倚μ-阿片受体激动剂、杀虫剂和具有水合常数的羰基化合物。这项工作介绍了SPARFlow,这是一个集成了knime的工作流程,它结合了数据管理、活动悬崖检测和SAR和SPR研究的可建模性评估。该工作流在单个KNIME管道中提供关键SAR分析的统一实现。它更新了既定指标的实施,包括MODI*和RMODI,以及补充指标,如SARI和SALI。它确保所有模块之间的数据流一致。
{"title":"SPARFlow: a KNIME workflow for integrated structure-activity or structure-property relationship analysis.","authors":"Elier E Abreu-Martínez, Karina Martinez-Mayorga, Gabriel Merino","doi":"10.1186/s13321-026-01156-y","DOIUrl":"https://doi.org/10.1186/s13321-026-01156-y","url":null,"abstract":"<p><p>We developed SPARFlow, an open-source KNIME workflow for structure-activity or structure-property relationship (SAR/SPR) analyses. The workflow integrates data preprocessing, chemical structure curation, similarity network construction, maximum common substructure detection, R-group decomposition, activity cliff identification, and database modelability assessment. It implements established indices, including SALI, SARI, MODI*, and RMODI, to characterize SAR landscapes and assess dataset suitability for predictive modeling. SPARFlow was validated using four datasets with distinct chemical and endpoint characteristics: cruzain inhibitors, biased μ-opioid receptor agonists, pesticides, and carbonyl compounds with hydration constants.Scientific ContributionThis work introduces SPARFlow, an KNIME-integrated workflow that combines data curation, activity-cliff detection, and modelability assessment for SAR and SPR studies. The workflow provides a unified implementation of key SAR analyses within a single KNIME pipeline. It updates implementations of established metrics, including MODI* and RMODI, together with complementary indices such as SARI and SALI. It ensures consistent data flow across all modules.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chemical genomics language model toward reliable and explainable compound-protein interaction exploration. 化学基因组学语言模型的可靠和可解释的化合物-蛋白质相互作用的探索。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-30 DOI: 10.1186/s13321-026-01155-z
Takuto Koyama, Hayato Tsumura, Ryunosuke Okita, Kimihiro Yamazaki, Aki Hasegawa, Keiko Imamura, Takashi Kato, Hiroaki Iwata, Ryosuke Kojima, Haruhisa Inoue, Shigeyuki Matsumoto, Yasushi Okuno

Accurate prediction of compound-protein interactions (CPIs) is crucial for chemical biology and drug discovery. Despite recent advancements, existing deep learning (DL)-based CPI models often struggle to simultaneously achieve high generalization performance, quantify prediction confidence, and ensure explainability. Here, we propose ChemGLaM, a chemical genomics language model designed to address these three crucial challenges, thereby enabling reliable and explainable CPI predictions. ChemGLaM integrates independently pre-trained chemical and protein language models through an interaction block with a cross-attention mechanism, achieving near state-of-the-art performance in predicting novel CPIs at a low computational cost. Incorporating uncertainty estimation and attention visualization enables ChemGLaM to enhance the success rate of virtual screening and to provide molecular insights into CPIs. To demonstrate the practical impact of ChemGLaM, we constructed a publicly available database containing large-scale CPI predictions for every possible pairing between all 20,434 human proteins and all 11,455 drugs and validated its practical applicability in a case study on amyotrophic lateral sclerosis. ChemGLaM marks an important step forward in addressing the challenges of AI-driven CPI exploration and drug discovery.Scientific ContributionThis study established a unified CPI prediction framework that simultaneously achieves high generalization performance, confidence quantification, and explainability. We leveraged this framework to create a community resource by constructing a comprehensive CPI database and demonstrated its practical utility by successfully prioritizing hit compounds and deconvoluting their targets in a phenotypic screening for amyotrophic lateral sclerosis.

化合物-蛋白质相互作用(CPIs)的准确预测对化学生物学和药物发现至关重要。尽管最近取得了进展,但现有的基于深度学习(DL)的CPI模型往往难以同时实现高泛化性能、量化预测置信度和确保可解释性。在这里,我们提出ChemGLaM,一种化学基因组学语言模型,旨在解决这三个关键挑战,从而实现可靠和可解释的CPI预测。ChemGLaM通过交叉注意机制的交互块集成了独立预训练的化学和蛋白质语言模型,以低计算成本实现了接近最先进的新型cpi预测性能。结合不确定性估计和注意力可视化,ChemGLaM能够提高虚拟筛选的成功率,并提供对cpi的分子见解。为了证明ChemGLaM的实际影响,我们构建了一个公开可用的数据库,其中包含所有20,434种人类蛋白质与所有11,455种药物之间每种可能配对的大规模CPI预测,并在肌萎缩性侧索硬化症的案例研究中验证了其实际适用性。ChemGLaM标志着在应对人工智能驱动的CPI探索和药物发现挑战方面迈出了重要一步。本研究建立了一个统一的CPI预测框架,同时实现了较高的泛化性能、置信度量化和可解释性。我们利用这个框架创建了一个社区资源,构建了一个全面的CPI数据库,并通过在肌萎缩性侧索硬化症的表型筛选中成功地确定了受影响化合物的优先级,并证明了其实际效用。
{"title":"Chemical genomics language model toward reliable and explainable compound-protein interaction exploration.","authors":"Takuto Koyama, Hayato Tsumura, Ryunosuke Okita, Kimihiro Yamazaki, Aki Hasegawa, Keiko Imamura, Takashi Kato, Hiroaki Iwata, Ryosuke Kojima, Haruhisa Inoue, Shigeyuki Matsumoto, Yasushi Okuno","doi":"10.1186/s13321-026-01155-z","DOIUrl":"https://doi.org/10.1186/s13321-026-01155-z","url":null,"abstract":"<p><p>Accurate prediction of compound-protein interactions (CPIs) is crucial for chemical biology and drug discovery. Despite recent advancements, existing deep learning (DL)-based CPI models often struggle to simultaneously achieve high generalization performance, quantify prediction confidence, and ensure explainability. Here, we propose ChemGLaM, a chemical genomics language model designed to address these three crucial challenges, thereby enabling reliable and explainable CPI predictions. ChemGLaM integrates independently pre-trained chemical and protein language models through an interaction block with a cross-attention mechanism, achieving near state-of-the-art performance in predicting novel CPIs at a low computational cost. Incorporating uncertainty estimation and attention visualization enables ChemGLaM to enhance the success rate of virtual screening and to provide molecular insights into CPIs. To demonstrate the practical impact of ChemGLaM, we constructed a publicly available database containing large-scale CPI predictions for every possible pairing between all 20,434 human proteins and all 11,455 drugs and validated its practical applicability in a case study on amyotrophic lateral sclerosis. ChemGLaM marks an important step forward in addressing the challenges of AI-driven CPI exploration and drug discovery.Scientific ContributionThis study established a unified CPI prediction framework that simultaneously achieves high generalization performance, confidence quantification, and explainability. We leveraged this framework to create a community resource by constructing a comprehensive CPI database and demonstrated its practical utility by successfully prioritizing hit compounds and deconvoluting their targets in a phenotypic screening for amyotrophic lateral sclerosis.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive evaluation of advanced methods for identifying structural alerts using extensive toxicity data. 利用广泛的毒性数据对识别结构警报的先进方法进行全面评估。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-30 DOI: 10.1186/s13321-026-01157-x
Ning-Ning Wang,Yuan-Hang He,Xin-Liang Li,Shao-Hua Shi,You-Chao Deng,Shao Liu,Dong-Sheng Cao
With the disclosure of the important role of substructural alerts (SA) in drug development and toxicity evaluation, many automatic substructure extraction tools based on different theoretical knowledge have been reported in recent years. To compare the emphasis of various substructure extraction methods and the reliability of their results, we were encouraged to conduct a comprehensive analysis of seven representative tools to find the best one. In this paper, we introduced a well-designed evaluation of seven popular tools (Bioalerts, KRFP, MoSS, PySmash_circular, PySmash_group, PySmash_path, and SARpy) based on 43 toxicity datasets, consisting of four components: comparison of substructures derived by different methods, comparison of predictive models based on substructural rules, comparison of the efficiency of extracting toxic substructures, and the effect of SAs on quantitative structure-activity relationship (QSAR) predictive models. The results demonstrated that PySmash_circular performed best overall, with satisfactory results in substructure information carrying and the rule-based predictive models. PySmash_path and Bioalerts were also recommended for their similar performance to PySmash_circular, but the main problem was that they took too much time and generated too many substructures. Specifically, Bioalerts and PySmash_circular could obtain substructures carrying richer information, while SARpy had the best predictive rule-based models, but it only focuses on precision (PR) value in the evaluation of individual SA. More than that, the substructures obtained by all 7 methods can enhance the recognition ability of the QSAR models for toxic compounds and make them interpretable. Finally, we have also made a baseline substructure set of 43 toxicity endpoints available to the public to facilitate further development of drug research and environmental safety assessment in a rapid and accurate direction.Scientific Contribution: Based on 43 toxicity datasets, we conducted a comprehensive evaluation of 7 representative substructure extraction tools from both the perspective of individual substructure and substructure-based models. This work not only enables users to make more autonomous choices of the optimal substructure extraction tool, but also provides the public with a benchmark substructure set of 43 toxicity endpoints, promoting the further development of computational toxicology.
随着子结构预警(substructural alerts, SA)在药物开发和毒性评价中的重要作用的揭示,近年来出现了许多基于不同理论知识的自动子结构提取工具。为了比较各种子结构提取方法的重点和结果的可靠性,我们被鼓励对7种有代表性的工具进行综合分析,以找到最好的一种。在本文中,我们介绍了一个精心设计的评估七个流行的工具(Bioalerts, KRFP, MoSS, PySmash_circular, PySmash_group, PySmash_path和SARpy)基于43个毒性数据集,由四个部分组成:不同方法子结构的比较,基于子结构规则的预测模型的比较,毒性子结构提取效率的比较,以及sa对定量构效关系(QSAR)预测模型的影响。结果表明,PySmash_circular在子结构信息携带和基于规则的预测模型方面均取得了令人满意的效果。PySmash_path和Bioalerts也被推荐使用,因为它们的性能与PySmash_circular相似,但主要问题是它们花费的时间太长,并且生成了太多的子结构。其中,Bioalerts和PySmash_circular可以获得携带更丰富信息的子结构,而SARpy基于规则的预测模型是最好的,但在评估单个SA时只关注精度(PR)值。不仅如此,7种方法得到的子结构都能提高QSAR模型对有毒化合物的识别能力,使其具有可解释性。最后,我们还向公众提供了43个毒性终点的基线子结构集,以促进药物研究和环境安全评估朝着快速准确的方向进一步发展。科学贡献:基于43个毒性数据集,从单个子结构和基于子结构的模型的角度对7种代表性的子结构提取工具进行了综合评价。这项工作不仅使用户能够更自主地选择最优子结构提取工具,而且为公众提供了43个毒性端点的基准子结构集,促进了计算毒理学的进一步发展。
{"title":"A comprehensive evaluation of advanced methods for identifying structural alerts using extensive toxicity data.","authors":"Ning-Ning Wang,Yuan-Hang He,Xin-Liang Li,Shao-Hua Shi,You-Chao Deng,Shao Liu,Dong-Sheng Cao","doi":"10.1186/s13321-026-01157-x","DOIUrl":"https://doi.org/10.1186/s13321-026-01157-x","url":null,"abstract":"With the disclosure of the important role of substructural alerts (SA) in drug development and toxicity evaluation, many automatic substructure extraction tools based on different theoretical knowledge have been reported in recent years. To compare the emphasis of various substructure extraction methods and the reliability of their results, we were encouraged to conduct a comprehensive analysis of seven representative tools to find the best one. In this paper, we introduced a well-designed evaluation of seven popular tools (Bioalerts, KRFP, MoSS, PySmash_circular, PySmash_group, PySmash_path, and SARpy) based on 43 toxicity datasets, consisting of four components: comparison of substructures derived by different methods, comparison of predictive models based on substructural rules, comparison of the efficiency of extracting toxic substructures, and the effect of SAs on quantitative structure-activity relationship (QSAR) predictive models. The results demonstrated that PySmash_circular performed best overall, with satisfactory results in substructure information carrying and the rule-based predictive models. PySmash_path and Bioalerts were also recommended for their similar performance to PySmash_circular, but the main problem was that they took too much time and generated too many substructures. Specifically, Bioalerts and PySmash_circular could obtain substructures carrying richer information, while SARpy had the best predictive rule-based models, but it only focuses on precision (PR) value in the evaluation of individual SA. More than that, the substructures obtained by all 7 methods can enhance the recognition ability of the QSAR models for toxic compounds and make them interpretable. Finally, we have also made a baseline substructure set of 43 toxicity endpoints available to the public to facilitate further development of drug research and environmental safety assessment in a rapid and accurate direction.Scientific Contribution: Based on 43 toxicity datasets, we conducted a comprehensive evaluation of 7 representative substructure extraction tools from both the perspective of individual substructure and substructure-based models. This work not only enables users to make more autonomous choices of the optimal substructure extraction tool, but also provides the public with a benchmark substructure set of 43 toxicity endpoints, promoting the further development of computational toxicology.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146089040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing chemical safety prediction: an integrated GNN framework with DFT-augmented cyclic compound solution. 推进化学品安全预测:基于dft增强环化合物溶液的集成GNN框架。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-28 DOI: 10.1186/s13321-026-01151-3
Seul Lee,Jooyeon Lee,Unghwi Yoon,Jahyun Koo,Young Wook Yoon,Yoonjae Cho,Seung-Ryul Hwang,Keunhong Jeong
The rapid proliferation of chemical substances presents significant challenges in assessing their safety-critical physicochemical properties. This study presents an integrated approach using Graph Neural Networks (GNNs) to predict three crucial properties for chemical safety assessment: Heat of Combustion (HoC), Vapor Pressure (VP), and Flashpoint. Leveraging comprehensive datasets of 4780, 3573, and 14,696 compounds respectively, we developed a unified prediction model that outperforms existing approaches. Our model achieves mean absolute errors of 126 J/mol (R2 = 0.993) for HoC, 0.617 log units (R2 = 0.898) for VP, and 14.42 °C (R2 = 0.839) for Flashpoint, representing notable improvements over conventional methods. Through detailed analysis, we identified and addressed a specific challenge in predicting HoC for cyclic compounds by implementing a hybrid approach combining DFT calculations and Random Forest modeling. This specialized treatment expanded our cyclic compound dataset from 12 to 55 compounds and achieved an R2 of 0.918 for these traditionally challenging structures. The model was integrated into a real-time prediction system using Flask, allowing users to input chemical structures through SMILES notation or direct drawing. The system includes features for comparing predictions with experimental data and benchmarking against common industrial chemicals (acetone, n-hexane, and n-decane), enhancing its practical utility in emergency response scenarios. Our approach provides a robust, unified solution for predicting multiple safety-critical properties simultaneously, addressing a crucial need in chemical safety assessment and emergency response planning. SCIENTIFIC CONTRIBUTION: Overall, this study provides an integrated framework that deploys three GNN-based prediction models within a common architecture and a real-time prediction system. For cyclic compounds, which exhibit systematic prediction challenges under the GNN framework, we incorporate a targeted alternative modeling strategy to improve predictive reliability, thereby enhancing the practical applicability of machine-learning approaches to chemical safety assessment.
化学物质的快速扩散对评估其安全临界物理化学性质提出了重大挑战。本研究提出了一种集成的方法,使用图神经网络(GNNs)来预测化学安全评估的三个关键属性:燃烧热(HoC),蒸汽压(VP)和闪点。利用4780、3573和14696种化合物的综合数据集,我们开发了一个统一的预测模型,优于现有的方法。该模型的平均绝对误差为:HoC为126 J/mol (R2 = 0.993), VP为0.617 log单位(R2 = 0.898), Flashpoint为14.42°C (R2 = 0.839),与传统方法相比有显著提高。通过详细的分析,我们确定并解决了通过实施结合DFT计算和随机森林建模的混合方法来预测环化合物HoC的具体挑战。这种特殊的处理方法将我们的环状化合物数据集从12个扩展到55个,并且对这些传统上具有挑战性的结构实现了R2为0.918。该模型被集成到使用Flask的实时预测系统中,允许用户通过SMILES符号或直接绘图输入化学结构。该系统包括将预测与实验数据进行比较的功能,以及与常用工业化学品(丙酮、正己烷和正癸烷)进行基准测试的功能,增强了其在应急响应场景中的实用性。我们的方法为同时预测多种安全关键特性提供了一个强大的、统一的解决方案,解决了化学品安全评估和应急响应计划中的关键需求。科学贡献:总体而言,本研究提供了一个集成框架,在一个通用架构和实时预测系统中部署了三个基于gnn的预测模型。对于在GNN框架下表现出系统性预测挑战的环状化合物,我们采用了有针对性的替代建模策略来提高预测可靠性,从而增强了机器学习方法在化学品安全评估中的实际适用性。
{"title":"Advancing chemical safety prediction: an integrated GNN framework with DFT-augmented cyclic compound solution.","authors":"Seul Lee,Jooyeon Lee,Unghwi Yoon,Jahyun Koo,Young Wook Yoon,Yoonjae Cho,Seung-Ryul Hwang,Keunhong Jeong","doi":"10.1186/s13321-026-01151-3","DOIUrl":"https://doi.org/10.1186/s13321-026-01151-3","url":null,"abstract":"The rapid proliferation of chemical substances presents significant challenges in assessing their safety-critical physicochemical properties. This study presents an integrated approach using Graph Neural Networks (GNNs) to predict three crucial properties for chemical safety assessment: Heat of Combustion (HoC), Vapor Pressure (VP), and Flashpoint. Leveraging comprehensive datasets of 4780, 3573, and 14,696 compounds respectively, we developed a unified prediction model that outperforms existing approaches. Our model achieves mean absolute errors of 126 J/mol (R2 = 0.993) for HoC, 0.617 log units (R2 = 0.898) for VP, and 14.42 °C (R2 = 0.839) for Flashpoint, representing notable improvements over conventional methods. Through detailed analysis, we identified and addressed a specific challenge in predicting HoC for cyclic compounds by implementing a hybrid approach combining DFT calculations and Random Forest modeling. This specialized treatment expanded our cyclic compound dataset from 12 to 55 compounds and achieved an R2 of 0.918 for these traditionally challenging structures. The model was integrated into a real-time prediction system using Flask, allowing users to input chemical structures through SMILES notation or direct drawing. The system includes features for comparing predictions with experimental data and benchmarking against common industrial chemicals (acetone, n-hexane, and n-decane), enhancing its practical utility in emergency response scenarios. Our approach provides a robust, unified solution for predicting multiple safety-critical properties simultaneously, addressing a crucial need in chemical safety assessment and emergency response planning. SCIENTIFIC CONTRIBUTION: Overall, this study provides an integrated framework that deploys three GNN-based prediction models within a common architecture and a real-time prediction system. For cyclic compounds, which exhibit systematic prediction challenges under the GNN framework, we incorporate a targeted alternative modeling strategy to improve predictive reliability, thereby enhancing the practical applicability of machine-learning approaches to chemical safety assessment.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"88 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STRIKER: a spectral metadata repairing tool for expanding the comprehensiveness of spectral libraries. STRIKER:光谱元数据修复工具,用于扩展光谱库的全面性。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-27 DOI: 10.1186/s13321-026-01150-4
Ahmed Karam,Asmaa Ramzy,Taghreed Khaled Abdelmoneim,Maha Mokhtar,Nada A Youssef,Aya Osama,Nabila Sabar,Sameh Magdeldin
The expansion of untargeted metabolomics has made publicly accessible spectral libraries indispensable for metabolite annotation and machine learning applications. Enhancing the quality and consistency of these libraries is crucial for improving the accuracy of metabolite identification and training machine learning models. However, public spectral libraries often suffer from variability in user submissions, unintentional errors, and a lack of standardization. Existing metadata cleaning and normalization tools typically exclude spectra with incorrect or unsupported metadata rather than attempting to correct them, resulting in the loss of valuable spectral data and associated metabolites details. This study introduces STRIKER (SpecTRal lIbrary maKER), a repair tool specifically designed to address adduct metadata deficiencies using a distance-based metric and a deep learning model. STRIKER leverages advanced similarity-based approaches to predict adducts in spectra lacking adduct metadata. It corrects adduct-related errors and standardizes adduct formatting using a deep learning model based on the multi-layer perceptron (MLP) algorithm. STRIKER achieved 95-99% correct adduct matching and 98% adduct correction accuracy. These corrections substantially reduce the number of missing or unusable spectra and metabolites, thereby enhancing the accuracy of metabolite identification and improving data quality for machine learning applications. The tool also facilitates a convenient construction of the Human Metabolome Database (HMDB) spectral library by integrating data files from the HMDB website. Furthermore, it enables users to extract customized sub libraries from larger libraries, supporting tailored analyses for specific research objectives with percised search space. STRIKER is an open-source, user-friendly Python graphical interface designed to be accessible to researchers with minimal bioinformatics expertise. Available at the following repository under an MIT license: https://striker-gui.sourceforge.io.Scientific contributionThe software is designed to preserve the maximum number of valid spectra in open mass spectral libraries, thereby supporting more comprehensive metabolite annotation in untargeted metabolomics. Its graphical user interface further facilitates the engagement of researchers without programming expertise, enabling them to enhance the quality and usability of spectral libraries.
非靶向代谢组学的扩展使得可公开访问的谱库对于代谢物注释和机器学习应用是必不可少的。提高这些库的质量和一致性对于提高代谢物鉴定和训练机器学习模型的准确性至关重要。然而,公共谱库经常受到用户提交的可变性、无意的错误和缺乏标准化的影响。现有的元数据清洗和归一化工具通常会排除不正确或不支持元数据的光谱,而不是试图纠正它们,从而导致宝贵的光谱数据和相关代谢物细节的丢失。本研究介绍了STRIKER (SpecTRal lIbrary maKER),这是一种专门设计的修复工具,使用基于距离的度量和深度学习模型来解决加合物元数据的缺陷。STRIKER利用先进的基于相似性的方法来预测缺乏加合物元数据的光谱中的加合物。它使用基于多层感知器(MLP)算法的深度学习模型来纠正与加合相关的错误并规范加合格式。STRIKER达到了95-99%的加合物匹配正确率和98%的加合物校正精度。这些校正大大减少了缺失或不可用的光谱和代谢物的数量,从而提高了代谢物识别的准确性,提高了机器学习应用程序的数据质量。该工具还通过整合人类代谢组数据库(HMDB)网站上的数据文件,方便了人类代谢组数据库(HMDB)谱库的构建。此外,它使用户能够从较大的库中提取定制的子库,支持针对特定研究目标的定制分析,并具有精确的搜索空间。STRIKER是一个开源的,用户友好的Python图形界面,旨在为具有最小生物信息学专业知识的研究人员提供访问。该软件旨在保留开放质谱库中有效光谱的最大数量,从而支持在非靶向代谢组学中更全面的代谢物注释。它的图形用户界面进一步促进了没有编程专业知识的研究人员的参与,使他们能够提高光谱库的质量和可用性。
{"title":"STRIKER: a spectral metadata repairing tool for expanding the comprehensiveness of spectral libraries.","authors":"Ahmed Karam,Asmaa Ramzy,Taghreed Khaled Abdelmoneim,Maha Mokhtar,Nada A Youssef,Aya Osama,Nabila Sabar,Sameh Magdeldin","doi":"10.1186/s13321-026-01150-4","DOIUrl":"https://doi.org/10.1186/s13321-026-01150-4","url":null,"abstract":"The expansion of untargeted metabolomics has made publicly accessible spectral libraries indispensable for metabolite annotation and machine learning applications. Enhancing the quality and consistency of these libraries is crucial for improving the accuracy of metabolite identification and training machine learning models. However, public spectral libraries often suffer from variability in user submissions, unintentional errors, and a lack of standardization. Existing metadata cleaning and normalization tools typically exclude spectra with incorrect or unsupported metadata rather than attempting to correct them, resulting in the loss of valuable spectral data and associated metabolites details. This study introduces STRIKER (SpecTRal lIbrary maKER), a repair tool specifically designed to address adduct metadata deficiencies using a distance-based metric and a deep learning model. STRIKER leverages advanced similarity-based approaches to predict adducts in spectra lacking adduct metadata. It corrects adduct-related errors and standardizes adduct formatting using a deep learning model based on the multi-layer perceptron (MLP) algorithm. STRIKER achieved 95-99% correct adduct matching and 98% adduct correction accuracy. These corrections substantially reduce the number of missing or unusable spectra and metabolites, thereby enhancing the accuracy of metabolite identification and improving data quality for machine learning applications. The tool also facilitates a convenient construction of the Human Metabolome Database (HMDB) spectral library by integrating data files from the HMDB website. Furthermore, it enables users to extract customized sub libraries from larger libraries, supporting tailored analyses for specific research objectives with percised search space. STRIKER is an open-source, user-friendly Python graphical interface designed to be accessible to researchers with minimal bioinformatics expertise. Available at the following repository under an MIT license: https://striker-gui.sourceforge.io.Scientific contributionThe software is designed to preserve the maximum number of valid spectra in open mass spectral libraries, thereby supporting more comprehensive metabolite annotation in untargeted metabolomics. Its graphical user interface further facilitates the engagement of researchers without programming expertise, enabling them to enhance the quality and usability of spectral libraries.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"8 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146056729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Paths to cheminformatics: Q&A with Rajarshi Guha 化学信息学之路:Rajarshi Guha问答。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-20 DOI: 10.1186/s13321-025-01133-x
Rajarshi Guha
{"title":"Paths to cheminformatics: Q&A with Rajarshi Guha","authors":"Rajarshi Guha","doi":"10.1186/s13321-025-01133-x","DOIUrl":"10.1186/s13321-025-01133-x","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01133-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1