首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction 结合图神经网络和转换器预测核受体结合活性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-27 DOI: 10.1186/s13321-024-00902-4
Luis H. M. Torres, Joel P. Arrais, Bernardete Ribeiro

Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.

Scientific contribution

The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.

核受体(NRs)作为生物靶点在药物研发中发挥着至关重要的作用。然而,在候选药物数量减少的情况下,确定哪些化合物可以作为内分泌干扰物并调节核受体的功能是一项具有挑战性的任务。此外,NR 结合活性预测的计算方法大多一次只针对一个受体,这可能会限制其有效性。因此,在多个 NR 之间转移所学知识可以提高分子预测器的性能,从而开发出更有效的药物。在这项研究中,我们整合了图神经网络(GNN)和变换器(Transformer),推出了一种几射 GNN-变换器 Meta-GTNRP,利用不同 NRs 的综合信息预测化合物的结合活性,并在数据有限的情况下识别潜在的 NR 调节剂。Meta-GTNRP 模型捕捉了图结构数据中的局部信息,并保留了分子图嵌入的全局语义结构,用于 NR 结合活性预测。此外,还提出了一种少量元学习方法,针对不同的 NR 结合任务优化模型参数,并利用多个 NR 特定任务之间的互补性,只需少量标记的分子就能预测化合物对每种 NR 的结合活性。使用包含 11 种 NR 结合活性注释的化合物数据库进行的实验表明,Meta-GTNRP 优于其他基于图的方法。数据和代码可在以下网址获取:https://github.com/ltorres97/Meta-GTNRP 。科学贡献 所提出的少量 GNN-Transformer 模型 Meta-GTNRP 可捕捉分子图的局部结构,并保留图嵌入的全局语义信息,从而在可用数据有限的情况下预测化合物的 NR 结合活性;在高度不平衡的数据场景中,Meta-GTNRP 是一种数据效率高的方法,它结合了 GNN 和 Transformers 的优势,通过优化的元学习程序预测化合物的 NR 结合特性,并提供有价值的稳健结果,以确定基于 NR 的潜在候选药物。
{"title":"Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction","authors":"Luis H. M. Torres,&nbsp;Joel P. Arrais,&nbsp;Bernardete Ribeiro","doi":"10.1186/s13321-024-00902-4","DOIUrl":"10.1186/s13321-024-00902-4","url":null,"abstract":"<div><p>Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.</p><p><b>Scientific contribution</b></p><p>The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00902-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models 基于集合和多任务注意力模型预测药物组合协同作用的多视角特征表征
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-27 DOI: 10.1186/s13321-024-00903-3
Samar Monem, Aboul Ella Hassanien, Alaa H. Abdel-Hamid

This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).

本文提出了一种新颖的多视角集合预测模型,旨在通过预测药物组合与癌细胞株的协同作用评分值和协同作用类别标签,解决确定协同作用药物组合的难题。所提出的方法包括通过四种不同的视图来表示药物特征:简化分子输入线输入系统(SMILES)特征、分子图特征、指纹特征和药物靶点特征。另一方面,通过四种视图捕捉细胞系特征:基因表达特征、拷贝数特征、突变特征和蛋白质组学特征。为防止模型过度拟合,我们采用了两种技术。首先,药物的每个视图特征与每个相应的细胞系视图配对,并输入多任务注意力深度学习模型。该多任务模型经过训练,可同时预测协同作用得分值和协同作用类别标签。这一过程会将十六个输入视图特征输入多任务模型,产生十六个预测值。随后,这些预测值被用作集合模型的输入,输出最终预测值。MVME "模型使用 O'Neil 数据集进行评估,该数据集包括 38 种不同药物在 39 种不同癌症细胞系中的组合,共输出 22737 对药物组合。在协同作用分值方面,建议模型的均方误差 (MSE) 为 206.57,均方根误差 (RMSE) 为 14.30,皮尔逊分值为 0.76。对于协同类标签,该模型的准确度得分为 0.90,精确度得分为 0.96,卡帕得分为 0.57,ROC 曲线下面积(ROC-AUC)得分为 0.96,精确度-召回曲线下面积(PR-AUC)得分为 0.88。本文利用四种不同的药物特征视图和四种癌症细胞系视图,提出了一种增强型协同药物组合模型。然后将每个视图输入多任务深度学习模型,以同时预测协同作用得分和类别标签。为了应对管理不同视图及其相应预测值的挑战,同时避免过拟合,应用了一个集合模型。
{"title":"A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models","authors":"Samar Monem,&nbsp;Aboul Ella Hassanien,&nbsp;Alaa H. Abdel-Hamid","doi":"10.1186/s13321-024-00903-3","DOIUrl":"10.1186/s13321-024-00903-3","url":null,"abstract":"<div><p>This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00903-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computer-aided pattern scoring (C@PS): a novel cheminformatic workflow to predict ligands with rare modes-of-action 计算机辅助模式评分(C@PS):预测具有罕见作用模式配体的新型化学信息学工作流程
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-23 DOI: 10.1186/s13321-024-00901-5
Sven Marcel Stefan, Katja Stefan, Vigneshwaran Namasivayam

The identification, establishment, and exploration of potential pharmacological drug targets are major steps of the drug development pipeline. Target validation requires diverse chemical tools that come with a spectrum of functionality, e.g., inhibitors, activators, and other modulators. Particularly tools with rare modes-of-action allow for a proper kinetic and functional characterization of the targets-of-interest (e.g., channels, enzymes, receptors, or transporters). Despite, functional innovation is a prime criterion for patentability and commercial exploitation, which may lead to therapeutic benefit. Unfortunately, data on new, and thus, undruggable or barely druggable targets are scarce and mostly available for mainstream modes-of-action only (e.g., inhibition). Here we present a novel cheminformatic workflow—computer-aided pattern scoring (C@PS)—which was specifically designed to project its prediction capabilities into an uncharted domain of applicability.

潜在药理药物靶点的识别、确立和探索是药物开发流程的主要步骤。靶点验证需要多种化学工具,如抑制剂、激活剂和其他调节剂等。特别是具有罕见作用模式的工具,可以对感兴趣的靶点(如通道、酶、受体或转运体)进行适当的动力学和功能表征。尽管功能创新是获得专利和商业开发的首要标准,但这可能会带来治疗效果。遗憾的是,有关新靶点的数据非常稀少,因此也就无法用药或几乎无法用药,而且大多只有主流作用方式(如抑制)的数据。在此,我们介绍一种新颖的化学信息学工作流程--计算机辅助模式评分(C@PS)--该流程专门设计用于将其预测能力投射到一个未知的适用领域。所介绍的工作流程首次解决了数据稀缺的难题,尤其是针对罕见作用模式。此外,该工作流程和相关数据集为合理选择候选药物的标准定义和应用提供了新的标准,解决了化学信息学、计算化学和药物化学领域的重要空白。
{"title":"Computer-aided pattern scoring (C@PS): a novel cheminformatic workflow to predict ligands with rare modes-of-action","authors":"Sven Marcel Stefan,&nbsp;Katja Stefan,&nbsp;Vigneshwaran Namasivayam","doi":"10.1186/s13321-024-00901-5","DOIUrl":"10.1186/s13321-024-00901-5","url":null,"abstract":"<div><p>The identification, establishment, and exploration of potential pharmacological drug targets are major steps of the drug development pipeline. Target validation requires diverse chemical tools that come with a spectrum of functionality, <i>e.g.</i>, inhibitors, activators, and other modulators. Particularly tools with rare modes-of-action allow for a proper kinetic and functional characterization of the targets-of-interest (<i>e.g.</i>, channels, enzymes, receptors, or transporters). Despite, functional innovation is a prime criterion for patentability and commercial exploitation, which may lead to therapeutic benefit. Unfortunately, data on new, and thus, undruggable or barely druggable targets are scarce and mostly available for mainstream modes-of-action only (<i>e.g.</i>, inhibition). Here we present a novel cheminformatic workflow—computer-aided pattern scoring (C@PS)—which was specifically designed to project its prediction capabilities into an uncharted domain of applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00901-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency EC-Conf:等变一致性分子构象生成超快扩散模型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-03 DOI: 10.1186/s13321-024-00893-2
Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen

Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. Here, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling.

尽管最近在扩散模型驱动的三维分子构象生成方面取得了进展,但其在迭代扩散/变色过程中的高计算成本限制了其应用。在此,我们提出了等变一致性模型(EC-Conf),作为低能构象生成的快速扩散方法。在 EC-Conf 中,直接使用改进的 SE (3)- 等变变换器模型来编码笛卡尔分子构象,并通过高效的一致性扩散过程来生成分子构象。结果表明,只需一个采样步骤,它就能达到与其他数千个去噪步骤的基于扩散的模型相当的质量。如果多进行几次采样迭代,其性能还能进一步提高。我们在 GEOM-QM9 和 GEOM-Drugs 集上对 EC-Conf 的性能进行了评估。我们的结果表明,EC-Conf 学习低能分子构象分布的效率比当前的 SOTA 扩散模型至少高出两个量级,有可能成为构象生成和采样的有用工具。科学贡献:在这项工作中,我们提出了一种等变一致性模型,它能显著提高基于扩散模型的构象生成效率,同时保持较高的结构质量。该方法可作为一个通用框架,并可在未来的步骤中进一步扩展到更复杂的结构生成和预测任务,包括涉及蛋白质的任务。
{"title":"EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency","authors":"Zhiguang Fan,&nbsp;Yuedong Yang,&nbsp;Mingyuan Xu,&nbsp;Hongming Chen","doi":"10.1186/s13321-024-00893-2","DOIUrl":"10.1186/s13321-024-00893-2","url":null,"abstract":"<p>Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. Here, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00893-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RAIChU: automating the visualisation of natural product biosynthesis RAIChU:实现天然产物生物合成的自动化可视化
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-03 DOI: 10.1186/s13321-024-00898-x
Barbara R. Terlouw, Friederike Biermann, Sophie P. J. M. Vromans, Elham Zamani, Eric J. N. Helfrich, Marnix H. Medema
<div><p>Natural products are molecules that fulfil a range of important ecological functions. Many natural products have been exploited for pharmaceutical and agricultural applications. In contrast to many other specialised metabolites, the products of modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) systems can often (partially) be predicted from the DNA sequence of the biosynthetic gene clusters. This is because the biosynthetic pathways of NRPS and PKS systems adhere to consistent rulesets. These universal biosynthetic rules can be leveraged to generate biosynthetic models of biosynthetic pathways. While these principles have been largely deciphered, software that leverages these rules to automatically generate visualisations of biosynthetic models has not yet been developed. To enable high-quality automated visualisations of natural product biosynthetic pathways, we developed RAIChU (Reaction Analysis through Illustrating Chemical Units), which produces depictions of biosynthetic transformations of PKS, NRPS, and hybrid PKS/NRPS systems from predicted or experimentally verified module architectures and domain substrate specificities. RAIChU also boasts a library of functions to perform and visualise reactions and pathways whose specifics (e.g., regioselectivity, stereoselectivity) are still difficult to predict, including terpenes, ribosomally synthesised and posttranslationally modified peptides and alkaloids. Additionally, RAIChU includes 34 prevalent tailoring reactions to enable the visualisation of biosynthetic pathways of fully maturated natural products. RAIChU can be integrated into Python pipelines, allowing users to upload and edit results from antiSMASH, a widely used BGC detection and annotation tool, or to build biosynthetic PKS/NRPS systems from scratch. RAIChU’s cluster drawing correctness (100%) and drawing readability (97.66%) were validated on 5000 randomly generated PKS/NRPS systems, and on the MIBiG database. The automated visualisation of these pathways accelerates the generation of biosynthetic models, facilitates the analysis of large (meta-) genomic datasets and reduces human error. RAIChU is available at https://github.com/BTheDragonMaster/RAIChU and https://pypi.org/project/raichu.</p><p><b>Scientific contribution</b></p><p>RAIChU is the first software package capable of automating high-quality visualisations of natural product biosynthetic pathways. By leveraging universal biosynthetic rules, RAIChU enables the depiction of complex biosynthetic transformations for PKS, NRPS, ribosomally synthesised and posttranslationally modified peptide (RiPP), terpene and alkaloid systems, enhancing predictive and analytical capabilities. This innovation not only streamlines the creation of biosynthetic models, making the analysis of large genomic datasets more efficient and accurate, but also bridges a crucial gap in predicting and visualising the complexities of natural product biosynthesis.</p></div
天然产品是具有一系列重要生态功能的分子。许多天然产物已被用于制药和农业用途。与许多其他专门的代谢产物不同,模块化非核糖体肽合成酶(NRPS)和多酮肽合成酶(PKS)系统的产物通常(部分)可以从生物合成基因簇的 DNA 序列中预测出来。这是因为 NRPS 和 PKS 系统的生物合成途径遵循一致的规则集。这些通用的生物合成规则可用于生成生物合成途径的生物合成模型。虽然这些原则已基本被破解,但利用这些规则自动生成生物合成模型可视化的软件尚未开发出来。为了实现天然产物生物合成途径的高质量自动可视化,我们开发了 RAIChU(通过说明化学单元进行反应分析),它可以根据预测或实验验证的模块架构和域底物特异性,生成 PKS、NRPS 和混合 PKS/NRPS 系统生物合成转化的描述。RAIChU 还拥有一个功能库,用于执行和可视化那些具体细节(如区域选择性、立体选择性)仍然难以预测的反应和途径,包括萜类、核糖体合成和翻译后修饰的肽和生物碱。此外,RAIChU 还包括 34 种常见的定制反应,可实现完全成熟的天然产品生物合成途径的可视化。RAIChU 可集成到 Python 管道中,允许用户上传和编辑来自反SMASH(一种广泛使用的 BGC 检测和注释工具)的结果,或从头开始构建生物合成 PKS/NRPS 系统。RAIChU 的聚类绘制正确率(100%)和绘制可读性(97.66%)在 5000 个随机生成的 PKS/NRPS 系统和 MIBiG 数据库上得到了验证。这些通路的自动可视化加快了生物合成模型的生成,促进了大型(元)基因组数据集的分析,并减少了人为错误。RAIChU 可在 https://github.com/BTheDragonMaster/RAIChU 和 https://pypi.org/project/raichu.Scientific 上下载。RAIChU 是第一个能够自动实现天然产物生物合成途径高质量可视化的软件包。通过利用通用生物合成规则,RAIChU 能够描述 PKS、NRPS、核糖体合成和翻译后修饰肽 (RiPP)、萜烯和生物碱系统的复杂生物合成转化,从而提高预测和分析能力。这项创新不仅简化了生物合成模型的创建过程,使大型基因组数据集的分析更加高效和准确,而且弥补了天然产物生物合成复杂性预测和可视化方面的重要空白。
{"title":"RAIChU: automating the visualisation of natural product biosynthesis","authors":"Barbara R. Terlouw,&nbsp;Friederike Biermann,&nbsp;Sophie P. J. M. Vromans,&nbsp;Elham Zamani,&nbsp;Eric J. N. Helfrich,&nbsp;Marnix H. Medema","doi":"10.1186/s13321-024-00898-x","DOIUrl":"10.1186/s13321-024-00898-x","url":null,"abstract":"&lt;div&gt;&lt;p&gt;Natural products are molecules that fulfil a range of important ecological functions. Many natural products have been exploited for pharmaceutical and agricultural applications. In contrast to many other specialised metabolites, the products of modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) systems can often (partially) be predicted from the DNA sequence of the biosynthetic gene clusters. This is because the biosynthetic pathways of NRPS and PKS systems adhere to consistent rulesets. These universal biosynthetic rules can be leveraged to generate biosynthetic models of biosynthetic pathways. While these principles have been largely deciphered, software that leverages these rules to automatically generate visualisations of biosynthetic models has not yet been developed. To enable high-quality automated visualisations of natural product biosynthetic pathways, we developed RAIChU (Reaction Analysis through Illustrating Chemical Units), which produces depictions of biosynthetic transformations of PKS, NRPS, and hybrid PKS/NRPS systems from predicted or experimentally verified module architectures and domain substrate specificities. RAIChU also boasts a library of functions to perform and visualise reactions and pathways whose specifics (e.g., regioselectivity, stereoselectivity) are still difficult to predict, including terpenes, ribosomally synthesised and posttranslationally modified peptides and alkaloids. Additionally, RAIChU includes 34 prevalent tailoring reactions to enable the visualisation of biosynthetic pathways of fully maturated natural products. RAIChU can be integrated into Python pipelines, allowing users to upload and edit results from antiSMASH, a widely used BGC detection and annotation tool, or to build biosynthetic PKS/NRPS systems from scratch. RAIChU’s cluster drawing correctness (100%) and drawing readability (97.66%) were validated on 5000 randomly generated PKS/NRPS systems, and on the MIBiG database. The automated visualisation of these pathways accelerates the generation of biosynthetic models, facilitates the analysis of large (meta-) genomic datasets and reduces human error. RAIChU is available at https://github.com/BTheDragonMaster/RAIChU and https://pypi.org/project/raichu.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Scientific contribution&lt;/b&gt;&lt;/p&gt;&lt;p&gt;RAIChU is the first software package capable of automating high-quality visualisations of natural product biosynthetic pathways. By leveraging universal biosynthetic rules, RAIChU enables the depiction of complex biosynthetic transformations for PKS, NRPS, ribosomally synthesised and posttranslationally modified peptide (RiPP), terpene and alkaloid systems, enhancing predictive and analytical capabilities. This innovation not only streamlines the creation of biosynthetic models, making the analysis of large genomic datasets more efficient and accurate, but also bridges a crucial gap in predicting and visualising the complexities of natural product biosynthesis.&lt;/p&gt;&lt;/div","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00898-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the generalizability of graph neural networks for predicting collision cross section 评估图神经网络预测碰撞截面的通用性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-29 DOI: 10.1186/s13321-024-00899-w
Chloe Engler Hart, António José Preto, Shaurya Chanana, David Healey, Tobias Kind, Daniel Domingo-Fernández

Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.

Scientific contribution

We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.

离子迁移率与质谱联用(IM-MS)是一种很有前途的分析技术,它通过测量碰撞截面(CCS)值来提高分子表征能力,而碰撞截面值是分子大小和形状的指标。然而,由于实验数据有限,CCS 值在结构分析中的有效应用仍然受到限制,因此有必要开发精确的机器学习(ML)模型进行硅学预测。在本研究中,我们评估了最先进的图神经网络(GNN),并使用迄今为止最大的公开可用数据集对其进行了训练,以预测 CCS 值。尽管我们的结果证实了这些模型在与其训练环境相似的化学空间内具有很高的准确性,但当它们应用于结构新颖的区域时,其性能却明显下降。这种差异引起了人们对硅学 CCS 预测可靠性的担忧,并强调了进一步发布公开 CCS 数据集的必要性。为了缓解这一问题,我们引入了 Mol2CCS,它展示了如何通过扩展模型来考虑分子指纹、描述符和分子类型等额外特征,从而部分提高通用性。最后,我们还展示了置信模型如何通过提高 CCS 估计值的可靠性来提供支持。科学贡献 我们对用于预测碰撞截面的最先进图神经网络进行了基准测试。我们的工作强调了这些模型在类似化学空间中进行训练和预测时的准确性,但也强调了在结构新颖的区域中进行评估时其准确性是如何下降的。最后,我们提出了缓解这一问题的潜在方法。
{"title":"Evaluating the generalizability of graph neural networks for predicting collision cross section","authors":"Chloe Engler Hart,&nbsp;António José Preto,&nbsp;Shaurya Chanana,&nbsp;David Healey,&nbsp;Tobias Kind,&nbsp;Daniel Domingo-Fernández","doi":"10.1186/s13321-024-00899-w","DOIUrl":"10.1186/s13321-024-00899-w","url":null,"abstract":"<div><p>Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.</p><p><b>Scientific contribution</b></p><p>We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00899-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BuildAMol: a versatile Python toolkit for fragment-based molecular design BuildAMol:基于片段的分子设计的通用 Python 工具包。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-25 DOI: 10.1186/s13321-024-00900-6
Noah Kleinschmidt, Thomas Lemmin

In recent years computational methods for molecular modeling have become a prime focus of computational biology and cheminformatics. Many dedicated systems exist for modeling specific classes of molecules such as proteins or small drug-like ligands. These are often heavily tailored toward the automated generation of molecular structures based on some meta-input by the user and are not intended for expert-driven structure assembly. Dedicated manual or semi-automated assembly software tools exist for a variety of molecule classes but are limited in the scope of structures they can produce. In this work we present BuildAMol, a highly flexible and extendable, general-purpose fragment-based molecular assembly toolkit. Written in Python and featuring a well-documented, user-friendly API, BuildAMol empowers researchers with a framework for detailed manual or semi-automated construction of diverse molecular models. Unlike specialized software, BuildAMol caters to a broad range of applications. We demonstrate its versatility across various use cases, encompassing generating metal complexes or the modeling of dendrimers or integrated into a drug discovery pipeline. By providing a robust foundation for expert-driven model building, BuildAMol holds promise as a valuable tool for the continuous integration and advancement of powerful deep learning techniques.

Scientific contribution

BuildAMol introduces a cutting-edge framework for molecular modeling that seamlessly blends versatility with user-friendly accessibility. This innovative toolkit integrates modeling, modification, optimization, and visualization functions within a unified API, and facilitates collaboration with other cheminformatics libraries. BuildAMol, with its shallow learning curve, serves as a versatile tool for various molecular applications while also laying the groundwork for the development of specialized software tools, contributing to the progress of molecular research and innovation.

近年来,分子建模的计算方法已成为计算生物学和化学信息学的首要关注点。有许多专用系统可用于蛋白质或小药物配体等特定类别分子的建模。这些系统通常是根据用户的一些元输入自动生成分子结构,而不是用于专家驱动的结构组装。目前已有针对各种分子类别的专用手动或半自动组装软件工具,但它们能生成的结构范围有限。在这项工作中,我们介绍了 BuildAMol,一个高度灵活、可扩展、基于片段的通用分子组装工具包。BuildAMol 使用 Python 编写,具有文档齐全、用户友好的应用程序接口(API),为研究人员提供了一个详细的手动或半自动构建各种分子模型的框架。与专用软件不同,BuildAMol 可满足广泛的应用需求。我们展示了它在各种用例中的多功能性,包括生成金属复合物或树枝状分子模型,或集成到药物发现流水线中。通过为专家驱动的模型构建提供强大的基础,BuildAMol有望成为持续集成和推进强大深度学习技术的宝贵工具。这一创新工具包在统一的应用程序接口(API)中集成了建模、修改、优化和可视化功能,并促进了与其他化学信息学库的协作。BuildAMol 的学习曲线较浅,是适用于各种分子应用的多功能工具,同时也为开发专用软件工具奠定了基础,促进了分子研究和创新的进步。
{"title":"BuildAMol: a versatile Python toolkit for fragment-based molecular design","authors":"Noah Kleinschmidt,&nbsp;Thomas Lemmin","doi":"10.1186/s13321-024-00900-6","DOIUrl":"10.1186/s13321-024-00900-6","url":null,"abstract":"<div><p>In recent years computational methods for molecular modeling have become a prime focus of computational biology and cheminformatics. Many dedicated systems exist for modeling specific classes of molecules such as proteins or small drug-like ligands. These are often heavily tailored toward the automated generation of molecular structures based on some meta-input by the user and are not intended for expert-driven structure assembly. Dedicated manual or semi-automated assembly software tools exist for a variety of molecule classes but are limited in the scope of structures they can produce. In this work we present BuildAMol, a highly flexible and extendable, general-purpose fragment-based molecular assembly toolkit. Written in Python and featuring a well-documented, user-friendly API, BuildAMol empowers researchers with a framework for detailed manual or semi-automated construction of diverse molecular models. Unlike specialized software, BuildAMol caters to a broad range of applications. We demonstrate its versatility across various use cases, encompassing generating metal complexes or the modeling of dendrimers or integrated into a drug discovery pipeline. By providing a robust foundation for expert-driven model building, BuildAMol holds promise as a valuable tool for the continuous integration and advancement of powerful deep learning techniques.</p><p><b>Scientific contribution</b></p><p>BuildAMol introduces a cutting-edge framework for molecular modeling that seamlessly blends versatility with user-friendly accessibility. This innovative toolkit integrates modeling, modification, optimization, and visualization functions within a unified API, and facilitates collaboration with other cheminformatics libraries. BuildAMol, with its shallow learning curve, serves as a versatile tool for various molecular applications while also laying the groundwork for the development of specialized software tools, contributing to the progress of molecular research and innovation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00900-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142054568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning of multimodal networks with topological regularization for drug repositioning 利用拓扑正则化深度学习多模态网络,实现药物重新定位
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-23 DOI: 10.1186/s13321-024-00897-y
Yuto Ohnuki, Manato Akiyama, Yasubumi Sakakibara

Motivation

Computational techniques for drug-disease prediction are essential in enhancing drug discovery and repositioning. While many methods utilize multimodal networks from various biological databases, few integrate comprehensive multi-omics data, including transcriptomes, proteomes, and metabolomes. We introduce STRGNN, a novel graph deep learning approach that predicts drug-disease relationships using extensive multimodal networks comprising proteins, RNAs, metabolites, and compounds. We have constructed a detailed dataset incorporating multi-omics data and developed a learning algorithm with topological regularization. This algorithm selectively leverages informative modalities while filtering out redundancies.

Results

STRGNN demonstrates superior accuracy compared to existing methods and has identified several novel drug effects, corroborating existing literature. STRGNN emerges as a powerful tool for drug prediction and discovery. The source code for STRGNN, along with the dataset for performance evaluation, is available at https://github.com/yuto-ohnuki/STRGNN.git.

动机用于药物-疾病预测的计算技术对于促进药物发现和重新定位至关重要。虽然许多方法都利用了来自各种生物数据库的多模态网络,但很少有方法能整合全面的多组学数据,包括转录组、蛋白质组和代谢组。我们介绍的 STRGNN 是一种新颖的图深度学习方法,它利用由蛋白质、RNA、代谢物和化合物组成的广泛多模态网络预测药物与疾病的关系。我们构建了一个包含多组学数据的详细数据集,并开发了一种具有拓扑正则化的学习算法。结果与现有方法相比,STRGNN 的准确性更胜一筹,并发现了几种新的药物效应,证实了现有文献的观点。STRGNN 成为药物预测和发现的强大工具。STRGNN 的源代码以及用于性能评估的数据集可在 https://github.com/yuto-ohnuki/STRGNN.git 网站上获取。
{"title":"Deep learning of multimodal networks with topological regularization for drug repositioning","authors":"Yuto Ohnuki,&nbsp;Manato Akiyama,&nbsp;Yasubumi Sakakibara","doi":"10.1186/s13321-024-00897-y","DOIUrl":"10.1186/s13321-024-00897-y","url":null,"abstract":"<div><h3>Motivation</h3><p>Computational techniques for drug-disease prediction are essential in enhancing drug discovery and repositioning. While many methods utilize multimodal networks from various biological databases, few integrate comprehensive multi-omics data, including transcriptomes, proteomes, and metabolomes. We introduce STRGNN, a novel graph deep learning approach that predicts drug-disease relationships using extensive multimodal networks comprising proteins, RNAs, metabolites, and compounds. We have constructed a detailed dataset incorporating multi-omics data and developed a learning algorithm with topological regularization. This algorithm selectively leverages informative modalities while filtering out redundancies.</p><h3>Results</h3><p>STRGNN demonstrates superior accuracy compared to existing methods and has identified several novel drug effects, corroborating existing literature. STRGNN emerges as a powerful tool for drug prediction and discovery. The source code for STRGNN, along with the dataset for performance evaluation, is available at https://github.com/yuto-ohnuki/STRGNN.git.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00897-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic molecular fragmentation by evolutionary optimisation 通过进化优化实现自动分子破碎
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-19 DOI: 10.1186/s13321-024-00896-z
Fiona C. Y. Yu, Jorge L. Gálvez Vallejo, Giuseppe M. J. Barca

Molecular fragmentation is an effective suite of approaches to reduce the formal computational complexity of quantum chemistry calculations while enhancing their algorithmic parallelisability. However, the practical applicability of fragmentation techniques remains hindered by a dearth of automation and effective metrics to assess the quality of a fragmentation scheme. In this article, we present the Quick Fragmentation via Automated Genetic Search (QFRAGS), a novel automated fragmentation algorithm that uses a genetic optimisation procedure to generate molecular fragments that yield low energy errors when adopted in Many Body Expansions (MBEs). Benchmark testing of QFRAGS on protein systems with less than 500 atoms, using two-body (MBE2) and three-body (MBE3) MBE calculations at the HF/6-31G* level, reveals mean absolute energy errors (MAEE) of 20.6 and 2.2 kJ (hbox {mol}^{-1}), respectively. For larger protein systems exceeding 500 atoms, MAEEs are 181.5 kJ (hbox {mol}^{-1}) for MBE2 and 24.3 kJ (hbox {mol}^{-1}) for MBE3. Furthermore, when compared to three manual fragmentation schemes on a 40-protein dataset, using both MBE and Fragment Molecular Orbital techniques, QFRAGS achieves comparable or often lower MAEEs. When applied to a 10-lipoglycan/glycolipid dataset, MAEs of 7.9 and 0.3 kJ (hbox {mol}^{-1}) were observed at the MBE2 and MBE3 levels, respectively.

Scientific Contribution This Article presents the Quick Fragmentation via Automated Genetic Search (QFRAGS), an innovative molecular fragmentation algorithm that significantly improves upon existing molecular fragmentation approaches by specifically addressing their lack of automation and effective fragmentation quality metrics. With an evolutionary optimisation strategy, QFRAGS actively pursues high quality fragments, generating fragmentation schemes that exhibit minimal energy errors on systems with hundreds to thousands of atoms. The advent of QFRAGS represents a significant advancement in molecular fragmentation, greatly improving the accessibility and computational feasibility of accurate quantum chemistry calculations.

分子破碎是一套有效的方法,可降低量子化学计算的形式计算复杂度,同时提高算法的并行性。然而,由于缺乏自动化和有效的指标来评估分片方案的质量,分片技术的实际应用仍然受到阻碍。在这篇文章中,我们介绍了 "通过自动遗传搜索进行快速片段化"(QFRAGS),这是一种新型的自动片段化算法,它使用遗传优化程序生成分子片段,在多体扩展(MBE)中使用时产生低能量误差。通过在 HF/6-31G* 水平上使用二体(MBE2)和三体(MBE3)MBE 计算,对少于 500 个原子的蛋白质系统进行了 QFRAGS 基准测试,结果显示平均绝对能量误差(MAEE)分别为 20.6 和 2.2 kJ $hbox {mol}^{-1}$。对于超过 500 个原子的大型蛋白质系统,MBE2 的平均绝对能量误差为 181.5 kJ $hbox {mol}^{-1}$ 和 MBE3 的平均绝对能量误差为 24.3 kJ $hbox {mol}^{-1}$。此外,在使用 MBE 和片段分子轨道技术对 40 个蛋白质数据集进行人工片段分析时,QFRAGS 与三种人工片段分析方案进行了比较,QFRAGS 可获得相当或更低的 MAEE。当应用于 10 个脂聚糖/糖脂数据集时,在 MBE2 和 MBE3 水平上观察到的 MAE 分别为 7.9 和 0.3 kJ $hbox {mol}^{-1}$ 。科学贡献 本文介绍了 "通过自动遗传搜索进行快速破碎"(QFRAGS),这是一种创新的分子破碎算法,通过专门解决现有分子破碎方法缺乏自动化和有效破碎质量指标的问题,大大改进了现有的分子破碎方法。QFRAGS 采用进化优化策略,积极追求高质量的片段,生成的片段方案在拥有数百到数千个原子的系统中表现出最小的能量误差。QFRAGS 的出现代表了分子破碎领域的重大进步,大大提高了精确量子化学计算的可及性和计算可行性。
{"title":"Automatic molecular fragmentation by evolutionary optimisation","authors":"Fiona C. Y. Yu,&nbsp;Jorge L. Gálvez Vallejo,&nbsp;Giuseppe M. J. Barca","doi":"10.1186/s13321-024-00896-z","DOIUrl":"10.1186/s13321-024-00896-z","url":null,"abstract":"<div><p>Molecular fragmentation is an effective suite of approaches to reduce the formal computational complexity of quantum chemistry calculations while enhancing their algorithmic parallelisability. However, the practical applicability of fragmentation techniques remains hindered by a dearth of automation and effective metrics to assess the quality of a fragmentation scheme. In this article, we present the Quick Fragmentation via Automated Genetic Search (QFRAGS), a novel automated fragmentation algorithm that uses a genetic optimisation procedure to generate molecular fragments that yield low energy errors when adopted in Many Body Expansions (MBEs). Benchmark testing of QFRAGS on protein systems with less than 500 atoms, using two-body (MBE2) and three-body (MBE3) MBE calculations at the HF/6-31G* level, reveals mean absolute energy errors (MAEE) of 20.6 and 2.2 kJ <span>(hbox {mol}^{-1})</span>, respectively. For larger protein systems exceeding 500 atoms, MAEEs are 181.5 kJ <span>(hbox {mol}^{-1})</span> for MBE2 and 24.3 kJ <span>(hbox {mol}^{-1})</span> for MBE3. Furthermore, when compared to three manual fragmentation schemes on a 40-protein dataset, using both MBE and Fragment Molecular Orbital techniques, QFRAGS achieves comparable or often lower MAEEs. When applied to a 10-lipoglycan/glycolipid dataset, MAEs of 7.9 and 0.3 kJ <span>(hbox {mol}^{-1})</span> were observed at the MBE2 and MBE3 levels, respectively.</p><p><b>Scientific Contribution</b> This Article presents the Quick Fragmentation via Automated Genetic Search (QFRAGS), an innovative molecular fragmentation algorithm that significantly improves upon existing molecular fragmentation approaches by specifically addressing their lack of automation and effective fragmentation quality metrics. With an evolutionary optimisation strategy, QFRAGS actively pursues high quality fragments, generating fragmentation schemes that exhibit minimal energy errors on systems with hundreds to thousands of atoms. The advent of QFRAGS represents a significant advancement in molecular fragmentation, greatly improving the accessibility and computational feasibility of accurate quantum chemistry calculations.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00896-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142002768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow 化学信息学民主化:利用 KNIME 自动工作流程进行可解释的化学分组
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-16 DOI: 10.1186/s13321-024-00894-1
José T. Moreira-Filho, Dhruv Ranganath, Mike Conway, Charles Schmitt, Nicole Kleinstreuer, Kamel Mansouri

With the increased availability of chemical data in public databases, innovative techniques and algorithms have emerged for the analysis, exploration, visualization, and extraction of information from these data. One such technique is chemical grouping, where chemicals with common characteristics are categorized into distinct groups based on physicochemical properties, use, biological activity, or a combination. However, existing tools for chemical grouping often require specialized programming skills or the use of commercial software packages. To address these challenges, we developed a user-friendly chemical grouping workflow implemented in KNIME, a free, open-source, low/no-code, data analytics platform. The workflow serves as an all-encompassing tool, expertly incorporating a range of processes such as molecular descriptor calculation, feature selection, dimensionality reduction, hyperparameter search, and supervised and unsupervised machine learning methods, enabling effective chemical grouping and visualization of results. Furthermore, we implemented tools for interpretation, identifying key molecular descriptors for the chemical groups, and using natural language summaries to clarify the rationale behind these groupings. The workflow was designed to run seamlessly in both the KNIME local desktop version and KNIME Server WebPortal as a web application. It incorporates interactive interfaces and guides to assist users in a step-by-step manner. We demonstrate the utility of this workflow through a case study using an eye irritation and corrosion dataset.

Scientific contributions

This work presents a novel, comprehensive chemical grouping workflow in KNIME, enhancing accessibility by integrating a user-friendly graphical interface that eliminates the need for extensive programming skills. This workflow uniquely combines several features such as automated molecular descriptor calculation, feature selection, dimensionality reduction, and machine learning algorithms (both supervised and unsupervised), with hyperparameter optimization to refine chemical grouping accuracy. Moreover, we have introduced an innovative interpretative step and natural language summaries to elucidate the underlying reasons for chemical groupings, significantly advancing the usability of the tool and interpretability of the results.

随着化学数据在公共数据库中的可用性不断提高,用于分析、探索、可视化以及从这些数据中提取信息的创新技术和算法也应运而生。其中一种技术是化学分组,即根据理化性质、用途、生物活性或其组合,将具有共同特征的化学物质分为不同的组别。然而,现有的化学分组工具通常需要专业的编程技能或使用商业软件包。为了应对这些挑战,我们在 KNIME(一个免费、开源、低代码/无代码的数据分析平台)中开发了一个用户友好型化学分组工作流。该工作流是一个全方位的工具,专业地整合了分子描述符计算、特征选择、降维、超参数搜索、监督和非监督机器学习方法等一系列过程,实现了有效的化学分组和结果可视化。此外,我们还实施了解释工具,确定化学组的关键分子描述符,并使用自然语言摘要阐明这些分组背后的原理。该工作流程可在 KNIME 本地桌面版和 KNIME 服务器 WebPortal 网页应用中无缝运行。工作流程包含互动界面和指南,可帮助用户逐步操作。科学贡献 本研究在 KNIME 中提出了一种新颖、全面的化学分组工作流,通过集成用户友好的图形界面提高了工作流的可访问性,无需大量的编程技能。该工作流程独特地结合了多种功能,如自动分子描述符计算、特征选择、降维、机器学习算法(监督和非监督)以及超参数优化,以提高化学分组的准确性。此外,我们还引入了创新的解释步骤和自然语言摘要,以阐明化学分组的根本原因,从而大大提高了工具的可用性和结果的可解释性。
{"title":"Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow","authors":"José T. Moreira-Filho,&nbsp;Dhruv Ranganath,&nbsp;Mike Conway,&nbsp;Charles Schmitt,&nbsp;Nicole Kleinstreuer,&nbsp;Kamel Mansouri","doi":"10.1186/s13321-024-00894-1","DOIUrl":"10.1186/s13321-024-00894-1","url":null,"abstract":"<div><p>With the increased availability of chemical data in public databases, innovative techniques and algorithms have emerged for the analysis, exploration, visualization, and extraction of information from these data. One such technique is chemical grouping, where chemicals with common characteristics are categorized into distinct groups based on physicochemical properties, use, biological activity, or a combination. However, existing tools for chemical grouping often require specialized programming skills or the use of commercial software packages. To address these challenges, we developed a user-friendly chemical grouping workflow implemented in KNIME, a free, open-source, low/no-code, data analytics platform. The workflow serves as an all-encompassing tool, expertly incorporating a range of processes such as molecular descriptor calculation, feature selection, dimensionality reduction, hyperparameter search, and supervised and unsupervised machine learning methods, enabling effective chemical grouping and visualization of results. Furthermore, we implemented tools for interpretation, identifying key molecular descriptors for the chemical groups, and using natural language summaries to clarify the rationale behind these groupings. The workflow was designed to run seamlessly in both the KNIME local desktop version and KNIME Server WebPortal as a web application. It incorporates interactive interfaces and guides to assist users in a step-by-step manner. We demonstrate the utility of this workflow through a case study using an eye irritation and corrosion dataset.</p><p><b>Scientific contributions</b></p><p>This work presents a novel, comprehensive chemical grouping workflow in KNIME, enhancing accessibility by integrating a user-friendly graphical interface that eliminates the need for extensive programming skills. This workflow uniquely combines several features such as automated molecular descriptor calculation, feature selection, dimensionality reduction, and machine learning algorithms (both supervised and unsupervised), with hyperparameter optimization to refine chemical grouping accuracy. Moreover, we have introduced an innovative interpretative step and natural language summaries to elucidate the underlying reasons for chemical groupings, significantly advancing the usability of the tool and interpretability of the results.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":null,"pages":null},"PeriodicalIF":7.1,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00894-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141994034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1