首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Large-scale annotation of biochemically relevant pockets and tunnels in cognate enzyme–ligand complexes 大规模注释同源酶配体中的生化相关口袋和隧道
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-15 DOI: 10.1186/s13321-024-00907-z
O. Vavra, J. Tyzack, F. Haddadi, J. Stourac, J. Damborsky, S. Mazurenko, J. M. Thornton, D. Bednar

Tunnels in enzymes with buried active sites are key structural features allowing the entry of substrates and the release of products, thus contributing to the catalytic efficiency. Targeting the bottlenecks of protein tunnels is also a powerful protein engineering strategy. However, the identification of functional tunnels in multiple protein structures is a non-trivial task that can only be addressed computationally. We present a pipeline integrating automated structural analysis with an in-house machine-learning predictor for the annotation of protein pockets, followed by the calculation of the energetics of ligand transport via biochemically relevant tunnels. A thorough validation using eight distinct molecular systems revealed that CaverDock analysis of ligand un/binding is on par with time-consuming molecular dynamics simulations, but much faster. The optimized and validated pipeline was applied to annotate more than 17,000 cognate enzyme–ligand complexes. Analysis of ligand un/binding energetics indicates that the top priority tunnel has the most favourable energies in 75% of cases. Moreover, energy profiles of cognate ligands revealed that a simple geometry analysis can correctly identify tunnel bottlenecks only in 50% of cases. Our study provides essential information for the interpretation of results from tunnel calculation and energy profiling in mechanistic enzymology and protein engineering. We formulated several simple rules allowing identification of biochemically relevant tunnels based on the binding pockets, tunnel geometry, and ligand transport energy profiles.

Scientific contributions

The pipeline introduced in this work allows for the detailed analysis of a large set of protein–ligand complexes, focusing on transport pathways. We are introducing a novel predictor for determining the relevance of binding pockets for tunnel calculation. For the first time in the field, we present a high-throughput energetic analysis of ligand binding and unbinding, showing that approximate methods for these simulations can identify additional mutagenesis hotspots in enzymes compared to purely geometrical methods. The predictor is included in the supplementary material and can also be accessed at https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git. The tunnel data calculated in this study has been made publicly available as part of the ChannelsDB 2.0 database, accessible at https://channelsdb2.biodata.ceitec.cz/.

具有埋藏活性位点的酶中的隧道是允许底物进入和产物释放的关键结构特征,因此有助于提高催化效率。瞄准蛋白质隧道的瓶颈也是一种强大的蛋白质工程策略。然而,在多个蛋白质结构中识别功能性隧道是一项非同小可的任务,只能通过计算来解决。我们介绍了一种集成了自动结构分析和内部机器学习预测器的管道,用于注释蛋白质口袋,然后计算配体通过生化相关隧道运输的能量。使用八个不同的分子系统进行的全面验证表明,CaverDock 对配体解除/结合的分析与耗时的分子动力学模拟相当,但速度更快。经过优化和验证的管道被用于注释 17,000 多个同源酶配体复合物。配体解除/结合能量分析表明,在 75% 的情况下,最优先隧道具有最有利的能量。此外,同源配体的能量曲线显示,简单的几何分析只能在 50% 的情况下正确识别隧道瓶颈。我们的研究为解释机理酶学和蛋白质工程中隧道计算和能量剖析的结果提供了重要信息。我们制定了几条简单的规则,允许根据结合口袋、隧道几何形状和配体运输能量曲线识别与生物化学相关的隧道。 科学贡献这项工作中引入的管道可对大量蛋白质配体复合物进行详细分析,重点关注运输途径。我们引入了一种新颖的预测方法,用于确定结合口袋与隧道计算的相关性。在这一领域,我们首次提出了配体结合和解除结合的高通量能量分析,表明与纯粹的几何方法相比,这些模拟的近似方法可以发现酶中更多的诱变热点。预测器包含在补充材料中,也可通过 https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git 访问。本研究中计算的隧道数据已作为 ChannelsDB 2.0 数据库的一部分公开发布,访问网址为 https://channelsdb2.biodata.ceitec.cz/。
{"title":"Large-scale annotation of biochemically relevant pockets and tunnels in cognate enzyme–ligand complexes","authors":"O. Vavra,&nbsp;J. Tyzack,&nbsp;F. Haddadi,&nbsp;J. Stourac,&nbsp;J. Damborsky,&nbsp;S. Mazurenko,&nbsp;J. M. Thornton,&nbsp;D. Bednar","doi":"10.1186/s13321-024-00907-z","DOIUrl":"10.1186/s13321-024-00907-z","url":null,"abstract":"<div><p>Tunnels in enzymes with buried active sites are key structural features allowing the entry of substrates and the release of products, thus contributing to the catalytic efficiency. Targeting the bottlenecks of protein tunnels is also a powerful protein engineering strategy. However, the identification of functional tunnels in multiple protein structures is a non-trivial task that can only be addressed computationally. We present a pipeline integrating automated structural analysis with an <i>in-house</i> machine-learning predictor for the annotation of protein pockets, followed by the calculation of the energetics of ligand transport via biochemically relevant tunnels. A thorough validation using eight distinct molecular systems revealed that CaverDock analysis of ligand un/binding is on par with time-consuming molecular dynamics simulations, but much faster. The optimized and validated pipeline was applied to annotate more than 17,000 cognate enzyme–ligand complexes. Analysis of ligand un/binding energetics indicates that the top priority tunnel has the most favourable energies in 75% of cases. Moreover, energy profiles of cognate ligands revealed that a simple geometry analysis can correctly identify tunnel bottlenecks only in 50% of cases. Our study provides essential information for the interpretation of results from tunnel calculation and energy profiling in mechanistic enzymology and protein engineering. We formulated several simple rules allowing identification of biochemically relevant tunnels based on the binding pockets, tunnel geometry, and ligand transport energy profiles.</p><p><b>Scientific contributions</b></p><p>The pipeline introduced in this work allows for the detailed analysis of a large set of protein–ligand complexes, focusing on transport pathways. We are introducing a novel predictor for determining the relevance of binding pockets for tunnel calculation. For the first time in the field, we present a high-throughput energetic analysis of ligand binding and unbinding, showing that approximate methods for these simulations can identify additional mutagenesis hotspots in enzymes compared to purely geometrical methods. The predictor is included in the supplementary material and can also be accessed at https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git. The tunnel data calculated in this study has been made publicly available as part of the ChannelsDB 2.0 database, accessible at https://channelsdb2.biodata.ceitec.cz/.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00907-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bitter peptide prediction using graph neural networks 利用图神经网络预测苦味肽
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-07 DOI: 10.1186/s13321-024-00909-x
Prashant Srivastava, Alexandra Steuer, Francesco Ferri, Alessandro Nicoli, Kristian Schultz, Saptarshi Bej, Antonella Di Pizio, Olaf Wolkenhauer

Bitter taste is an unpleasant taste modality that affects food consumption. Bitter peptides are generated during enzymatic processes that produce functional, bioactive protein hydrolysates or during the aging process of fermented products such as cheese, soybean protein, and wine. Understanding the underlying peptide sequences responsible for bitter taste can pave the way for more efficient identification of these peptides. This paper presents BitterPep-GCN, a feature-agnostic graph convolution network for bitter peptide prediction. The graph-based model learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. BitterPep-GCN was benchmarked using BTP640, a publicly available bitter peptide dataset. The latent peptide embeddings generated by the trained model were used to analyze the activity of sequence motifs responsible for the bitter taste of the peptides. Particularly, we calculated the activity for individual amino acids and dipeptide, tripeptide, and tetrapeptide sequence motifs present in the peptides. Our analyses pinpoint specific amino acids, such as F, G, P, and R, as well as sequence motifs, notably tripeptide and tetrapeptide motifs containing FF, as key bitter signatures in peptides. This work not only provides a new predictor of bitter taste for a more efficient identification of bitter peptides in various food products but also gives a hint into the molecular basis of bitterness.

Scientific Contribution

Our work provides the first application of Graph Neural Networks for the prediction of peptide bitter taste. The best-developed model, BitterPep-GCN, learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. The embeddings were used to analyze the sequence motifs responsible for the bitter taste.

苦味是一种影响食物消费的令人不快的味觉模式。苦味肽是在产生功能性生物活性蛋白质水解物的酶解过程中,或在奶酪、大豆蛋白和葡萄酒等发酵产品的陈酿过程中产生的。了解造成苦味的基本肽序列可以为更有效地鉴定这些肽铺平道路。本文介绍了用于苦味肽预测的特征识别图卷积网络 BitterPep-GCN。该基于图的模型可学习苦味肽序列中氨基酸的嵌入,并使用混合池法进行苦味分类。BitterPep-GCN 利用公开的苦味肽数据集 BTP640 进行了基准测试。训练模型生成的潜在肽嵌入被用来分析造成肽苦味的序列主题的活性。特别是,我们计算了肽中存在的单个氨基酸以及二肽、三肽和四肽序列主题的活性。分析结果表明,特定氨基酸(如 F、G、P 和 R)和序列基序(尤其是含有 FF 的三肽和四肽基序)是多肽中主要的苦味特征。这项工作不仅为更有效地识别各种食品中的苦味肽提供了一种新的苦味预测指标,还为苦味的分子基础提供了线索。科学贡献 我们的研究首次将图神经网络应用于肽苦味的预测。开发的最佳模型 BitterPep-GCN 学习苦味肽序列中氨基酸的嵌入,并使用混合池进行苦味分类。嵌入被用来分析造成苦味的序列主题。
{"title":"Bitter peptide prediction using graph neural networks","authors":"Prashant Srivastava,&nbsp;Alexandra Steuer,&nbsp;Francesco Ferri,&nbsp;Alessandro Nicoli,&nbsp;Kristian Schultz,&nbsp;Saptarshi Bej,&nbsp;Antonella Di Pizio,&nbsp;Olaf Wolkenhauer","doi":"10.1186/s13321-024-00909-x","DOIUrl":"10.1186/s13321-024-00909-x","url":null,"abstract":"<div><p>Bitter taste is an unpleasant taste modality that affects food consumption. Bitter peptides are generated during enzymatic processes that produce functional, bioactive protein hydrolysates or during the aging process of fermented products such as cheese, soybean protein, and wine. Understanding the underlying peptide sequences responsible for bitter taste can pave the way for more efficient identification of these peptides. This paper presents BitterPep-GCN, a feature-agnostic graph convolution network for bitter peptide prediction. The graph-based model learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. BitterPep-GCN was benchmarked using BTP640, a publicly available bitter peptide dataset. The latent peptide embeddings generated by the trained model were used to analyze the activity of sequence motifs responsible for the bitter taste of the peptides. Particularly, we calculated the activity for individual amino acids and dipeptide, tripeptide, and tetrapeptide sequence motifs present in the peptides. Our analyses pinpoint specific amino acids, such as F, G, P, and R, as well as sequence motifs, notably tripeptide and tetrapeptide motifs containing FF, as key bitter signatures in peptides. This work not only provides a new predictor of bitter taste for a more efficient identification of bitter peptides in various food products but also gives a hint into the molecular basis of bitterness.</p><p><b>Scientific Contribution</b></p><p>Our work provides the first application of Graph Neural Networks for the prediction of peptide bitter taste. The best-developed model, BitterPep-GCN, learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. The embeddings were used to analyze the sequence motifs responsible for the bitter taste.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00909-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer 对 PubChem 生物测定记录的数据挖掘揭示了作为卵巢癌潜在治疗药物的多种 OXPHOS 抑制性化学类型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-07 DOI: 10.1186/s13321-024-00906-0
Sejal Sharma, Liping Feng, Nicha Boonpattrawong, Arvinder Kapur, Lisa Barroilhet, Manish S. Patankar, Spencer S. Ericksen
<div><p>Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior screening data or a target structure. For phenotypic or multi-protein pathway targets, it may not be clear which public assay records provide relevant data. The question also arises as to whether data collected from disparate assays might be usefully consolidated. Here, we report on the development and application of a data mining pipeline to examine these issues. To illustrate, we focus on identifying inhibitors of oxidative phosphorylation, a druggable metabolic process in epithelial ovarian tumors. The pipeline compiled 8415 available OXPHOS-related bioassays in the PubChem data repository involving 312,093 unique compound records. Application of PubChem assay activity annotations, PAINS (Pan Assay Interference Compounds), and Lipinski-like bioavailability filters yields 1852 putative OXPHOS-active compounds that fall into 464 clusters. These chemotypes are diverse but have relatively high hydrophobicity and molecular weight but lower complexity and drug-likeness. These chemotypes show a high abundance of bicyclic ring systems and oxygen containing functional groups including ketones, allylic oxides (alpha/beta unsaturated carbonyls), hydroxyl groups, and ethers. In contrast, amide and primary amine functional groups have a notably lower than random prevalence. UMAP representation of the chemical space shows strong divergence in the regions occupied by OXPHOS-inactive and -active compounds. Of the six compounds selected for biological testing, 4 showed statistically significant inhibition of electron transport in bioenergetics assays. Two of these four compounds, lacidipine and esbiothrin, increased in intracellular oxygen radicals (a major hallmark of most OXPHOS inhibitors) and decreased the viability of two ovarian cancer cell lines, ID8 and OVCAR5. Finally, data from the pipeline were used to train random forest and support vector classifiers that effectively prioritized OXPHOS inhibitory compounds within a held-out test set (ROCAUC 0.962 and 0.927, respectively) and on another set containing 44 documented OXPHOS inhibitors outside of the training set (ROCAUC 0.900 and 0.823). This prototype pipeline is extensible and could be adapted for focus screening on other phenotypic targets for which sufficient public data are available.</p><p><b>Scientific contribution</b></p><p>Here, we describe and apply an assay data mining pipeline to compile, process, filter, and mine public bioassay data. We believe the procedure may be more broadly applied to guide compound selection in early-stage hit finding on novel multi-protein mechanistic or phenotypic targets. To demonstrate the utility of our approach, we apply a data mining strategy on a large set of public assay data to find drug-like molecules that inhibit oxidative phosphorylation (OXPHOS) a
对目标优先的化合物集进行重点筛选可以有效替代高通量筛选(HTS)。对于大多数生物分子靶点,化合物优先排序模型取决于先前的筛选数据或靶点结构。对于表型或多蛋白通路靶点,可能不清楚哪些公共检测记录提供了相关数据。另外一个问题是,从不同检测方法中收集的数据是否可以进行有用的整合。在此,我们报告了数据挖掘管道的开发和应用情况,以研究这些问题。为了说明这一点,我们重点研究了氧化磷酸化抑制剂的鉴定,氧化磷酸化是上皮性卵巢肿瘤中的一种药物代谢过程。该管道编译了 PubChem 数据库中 8415 种可用的氧化磷酸化相关生物检测方法,涉及 312,093 条独特的化合物记录。应用 PubChem 检测活性注释、PAINS(泛检测干扰化合物)和类似 Lipinski 的生物利用度过滤器,得出了 1852 种推测具有 OXPHOS 活性的化合物,可归入 464 个群组。这些化学类型多种多样,但疏水性和分子量相对较高,复杂性和药物相似性较低。这些化学类型中含有大量双环系统和含氧官能团,包括酮、烯丙基氧化物(α/β 不饱和羰基)、羟基和醚。相比之下,酰胺和伯胺官能团的含量明显低于随机含量。化学空间的 UMAP 表示法显示,OXPHOS 活性化合物和活性化合物占据的区域存在很大差异。在被选中进行生物测试的六种化合物中,有四种在生物能测定中对电子传递有显著的统计学抑制作用。这四种化合物中的两种,即拉西地平(lacidipine)和艾生菌素(esbiothrin),增加了细胞内氧自由基(大多数 OXPHOS 抑制剂的主要特征),降低了两种卵巢癌细胞系 ID8 和 OVCAR5 的存活率。最后,来自该管道的数据被用于训练随机森林和支持向量分类器,这些分类器能有效地在一个保留的测试集中优先选择 OXPHOS 抑制化合物(ROCAUC 分别为 0.962 和 0.927),并在另一个包含 44 种训练集以外的记录在案的 OXPHOS 抑制剂的测试集中优先选择 OXPHOS 抑制化合物(ROCAUC 分别为 0.900 和 0.823)。该原型管道具有可扩展性,可用于对其他有足够公开数据的表型靶标进行重点筛选。科学贡献 在这里,我们描述并应用了一种化验数据挖掘管道来编译、处理、过滤和挖掘公共生物化验数据。我们相信,该程序可以更广泛地应用于指导化合物的选择,从而在早期阶段发现新的多蛋白机理或表型靶点。为了证明我们的方法的实用性,我们在大量公共检测数据集上应用数据挖掘策略,寻找抑制氧化磷酸化(OXPHOS)的类药物分子,作为卵巢癌疗法的候选药物。
{"title":"Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer","authors":"Sejal Sharma,&nbsp;Liping Feng,&nbsp;Nicha Boonpattrawong,&nbsp;Arvinder Kapur,&nbsp;Lisa Barroilhet,&nbsp;Manish S. Patankar,&nbsp;Spencer S. Ericksen","doi":"10.1186/s13321-024-00906-0","DOIUrl":"10.1186/s13321-024-00906-0","url":null,"abstract":"&lt;div&gt;&lt;p&gt;Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior screening data or a target structure. For phenotypic or multi-protein pathway targets, it may not be clear which public assay records provide relevant data. The question also arises as to whether data collected from disparate assays might be usefully consolidated. Here, we report on the development and application of a data mining pipeline to examine these issues. To illustrate, we focus on identifying inhibitors of oxidative phosphorylation, a druggable metabolic process in epithelial ovarian tumors. The pipeline compiled 8415 available OXPHOS-related bioassays in the PubChem data repository involving 312,093 unique compound records. Application of PubChem assay activity annotations, PAINS (Pan Assay Interference Compounds), and Lipinski-like bioavailability filters yields 1852 putative OXPHOS-active compounds that fall into 464 clusters. These chemotypes are diverse but have relatively high hydrophobicity and molecular weight but lower complexity and drug-likeness. These chemotypes show a high abundance of bicyclic ring systems and oxygen containing functional groups including ketones, allylic oxides (alpha/beta unsaturated carbonyls), hydroxyl groups, and ethers. In contrast, amide and primary amine functional groups have a notably lower than random prevalence. UMAP representation of the chemical space shows strong divergence in the regions occupied by OXPHOS-inactive and -active compounds. Of the six compounds selected for biological testing, 4 showed statistically significant inhibition of electron transport in bioenergetics assays. Two of these four compounds, lacidipine and esbiothrin, increased in intracellular oxygen radicals (a major hallmark of most OXPHOS inhibitors) and decreased the viability of two ovarian cancer cell lines, ID8 and OVCAR5. Finally, data from the pipeline were used to train random forest and support vector classifiers that effectively prioritized OXPHOS inhibitory compounds within a held-out test set (ROCAUC 0.962 and 0.927, respectively) and on another set containing 44 documented OXPHOS inhibitors outside of the training set (ROCAUC 0.900 and 0.823). This prototype pipeline is extensible and could be adapted for focus screening on other phenotypic targets for which sufficient public data are available.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Scientific contribution&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Here, we describe and apply an assay data mining pipeline to compile, process, filter, and mine public bioassay data. We believe the procedure may be more broadly applied to guide compound selection in early-stage hit finding on novel multi-protein mechanistic or phenotypic targets. To demonstrate the utility of our approach, we apply a data mining strategy on a large set of public assay data to find drug-like molecules that inhibit oxidative phosphorylation (OXPHOS) a","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00906-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Insights into predicting small molecule retention times in liquid chromatography using deep learning 利用深度学习预测液相色谱中的小分子保留时间的启示
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-10-07 DOI: 10.1186/s13321-024-00905-1
Yuting Liu, Akiyasu C. Yoshizawa, Yiwei Ling, Shujiro Okuda

In untargeted metabolomics, structures of small molecules are annotated using liquid chromatography-mass spectrometry by leveraging information from the molecular retention time (RT) in the chromatogram and m/z (formerly called ''mass-to-charge ratio'') in the mass spectrum. However, correct identification of metabolites is challenging due to the vast array of small molecules. Therefore, various in silico tools for mass spectrometry peak alignment and compound prediction have been developed; however, the list of candidate compounds remains extensive. Accurate RT prediction is important to exclude false candidates and facilitate metabolite annotation. Recent advancements in artificial intelligence (AI) have led to significant breakthroughs in the use of deep learning models in various fields. Release of a large RT dataset has mitigated the bottlenecks limiting the application of deep learning models, thereby improving their application in RT prediction tasks. This review lists the databases that can be used to expand training datasets and concerns the issue about molecular representation inconsistencies in datasets. It also discusses the application of AI technology for RT prediction, particularly in the 5 years following the release of the METLIN small molecule RT dataset. This review provides a comprehensive overview of the AI applications used for RT prediction, highlighting the progress and remaining challenges.

在非靶向代谢组学中,通过利用色谱中的分子保留时间(RT)和质谱中的 m/z(以前称为 "质荷比")信息,使用液相色谱-质谱联用技术注释小分子的结构。然而,由于小分子的种类繁多,正确识别代谢物具有挑战性。因此,人们开发了各种用于质谱峰值配准和化合物预测的硅学工具;然而,候选化合物的清单仍然十分庞大。准确的 RT 预测对于排除错误候选化合物和促进代谢物注释非常重要。人工智能(AI)的最新进展使深度学习模型在各个领域的应用取得了重大突破。大型 RT 数据集的发布缓解了限制深度学习模型应用的瓶颈,从而改善了它们在 RT 预测任务中的应用。本综述列举了可用于扩展训练数据集的数据库,并关注数据集中分子表征不一致的问题。它还讨论了人工智能技术在 RT 预测中的应用,特别是在 METLIN 小分子 RT 数据集发布后的 5 年中。本综述全面概述了用于 RT 预测的人工智能应用,重点介绍了所取得的进展和仍然面临的挑战。本文重点介绍了过去五年来计算代谢组学在小分子保留时间预测方面取得的进展,并特别强调了人工智能技术在这一领域的应用。文章回顾了公开可用的小分子保留时间数据集、分子表征方法以及近期研究中应用的人工智能算法。此外,它还讨论了这些模型在协助小分子结构注释方面的有效性,以及为实现实际应用而必须应对的挑战。
{"title":"Insights into predicting small molecule retention times in liquid chromatography using deep learning","authors":"Yuting Liu,&nbsp;Akiyasu C. Yoshizawa,&nbsp;Yiwei Ling,&nbsp;Shujiro Okuda","doi":"10.1186/s13321-024-00905-1","DOIUrl":"10.1186/s13321-024-00905-1","url":null,"abstract":"<p>In untargeted metabolomics, structures of small molecules are annotated using liquid chromatography-mass spectrometry by leveraging information from the molecular retention time (RT) in the chromatogram and <i>m/z</i> (formerly called ''mass-to-charge ratio'') in the mass spectrum. However, correct identification of metabolites is challenging due to the vast array of small molecules. Therefore, various in silico tools for mass spectrometry peak alignment and compound prediction have been developed; however, the list of candidate compounds remains extensive. Accurate RT prediction is important to exclude false candidates and facilitate metabolite annotation. Recent advancements in artificial intelligence (AI) have led to significant breakthroughs in the use of deep learning models in various fields. Release of a large RT dataset has mitigated the bottlenecks limiting the application of deep learning models, thereby improving their application in RT prediction tasks. This review lists the databases that can be used to expand training datasets and concerns the issue about molecular representation inconsistencies in datasets. It also discusses the application of AI technology for RT prediction, particularly in the 5 years following the release of the METLIN small molecule RT dataset. This review provides a comprehensive overview of the AI applications used for RT prediction, highlighting the progress and remaining challenges.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00905-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction 结合图神经网络和转换器预测核受体结合活性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-27 DOI: 10.1186/s13321-024-00902-4
Luis H. M. Torres, Joel P. Arrais, Bernardete Ribeiro

Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.

Scientific contribution

The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.

核受体(NRs)作为生物靶点在药物研发中发挥着至关重要的作用。然而,在候选药物数量减少的情况下,确定哪些化合物可以作为内分泌干扰物并调节核受体的功能是一项具有挑战性的任务。此外,NR 结合活性预测的计算方法大多一次只针对一个受体,这可能会限制其有效性。因此,在多个 NR 之间转移所学知识可以提高分子预测器的性能,从而开发出更有效的药物。在这项研究中,我们整合了图神经网络(GNN)和变换器(Transformer),推出了一种几射 GNN-变换器 Meta-GTNRP,利用不同 NRs 的综合信息预测化合物的结合活性,并在数据有限的情况下识别潜在的 NR 调节剂。Meta-GTNRP 模型捕捉了图结构数据中的局部信息,并保留了分子图嵌入的全局语义结构,用于 NR 结合活性预测。此外,还提出了一种少量元学习方法,针对不同的 NR 结合任务优化模型参数,并利用多个 NR 特定任务之间的互补性,只需少量标记的分子就能预测化合物对每种 NR 的结合活性。使用包含 11 种 NR 结合活性注释的化合物数据库进行的实验表明,Meta-GTNRP 优于其他基于图的方法。数据和代码可在以下网址获取:https://github.com/ltorres97/Meta-GTNRP 。科学贡献 所提出的少量 GNN-Transformer 模型 Meta-GTNRP 可捕捉分子图的局部结构,并保留图嵌入的全局语义信息,从而在可用数据有限的情况下预测化合物的 NR 结合活性;在高度不平衡的数据场景中,Meta-GTNRP 是一种数据效率高的方法,它结合了 GNN 和 Transformers 的优势,通过优化的元学习程序预测化合物的 NR 结合特性,并提供有价值的稳健结果,以确定基于 NR 的潜在候选药物。
{"title":"Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction","authors":"Luis H. M. Torres,&nbsp;Joel P. Arrais,&nbsp;Bernardete Ribeiro","doi":"10.1186/s13321-024-00902-4","DOIUrl":"10.1186/s13321-024-00902-4","url":null,"abstract":"<div><p>Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.</p><p><b>Scientific contribution</b></p><p>The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00902-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models 基于集合和多任务注意力模型预测药物组合协同作用的多视角特征表征
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-27 DOI: 10.1186/s13321-024-00903-3
Samar Monem, Aboul Ella Hassanien, Alaa H. Abdel-Hamid

This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).

本文提出了一种新颖的多视角集合预测模型,旨在通过预测药物组合与癌细胞株的协同作用评分值和协同作用类别标签,解决确定协同作用药物组合的难题。所提出的方法包括通过四种不同的视图来表示药物特征:简化分子输入线输入系统(SMILES)特征、分子图特征、指纹特征和药物靶点特征。另一方面,通过四种视图捕捉细胞系特征:基因表达特征、拷贝数特征、突变特征和蛋白质组学特征。为防止模型过度拟合,我们采用了两种技术。首先,药物的每个视图特征与每个相应的细胞系视图配对,并输入多任务注意力深度学习模型。该多任务模型经过训练,可同时预测协同作用得分值和协同作用类别标签。这一过程会将十六个输入视图特征输入多任务模型,产生十六个预测值。随后,这些预测值被用作集合模型的输入,输出最终预测值。MVME "模型使用 O'Neil 数据集进行评估,该数据集包括 38 种不同药物在 39 种不同癌症细胞系中的组合,共输出 22737 对药物组合。在协同作用分值方面,建议模型的均方误差 (MSE) 为 206.57,均方根误差 (RMSE) 为 14.30,皮尔逊分值为 0.76。对于协同类标签,该模型的准确度得分为 0.90,精确度得分为 0.96,卡帕得分为 0.57,ROC 曲线下面积(ROC-AUC)得分为 0.96,精确度-召回曲线下面积(PR-AUC)得分为 0.88。本文利用四种不同的药物特征视图和四种癌症细胞系视图,提出了一种增强型协同药物组合模型。然后将每个视图输入多任务深度学习模型,以同时预测协同作用得分和类别标签。为了应对管理不同视图及其相应预测值的挑战,同时避免过拟合,应用了一个集合模型。
{"title":"A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models","authors":"Samar Monem,&nbsp;Aboul Ella Hassanien,&nbsp;Alaa H. Abdel-Hamid","doi":"10.1186/s13321-024-00903-3","DOIUrl":"10.1186/s13321-024-00903-3","url":null,"abstract":"<div><p>This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00903-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computer-aided pattern scoring (C@PS): a novel cheminformatic workflow to predict ligands with rare modes-of-action 计算机辅助模式评分(C@PS):预测具有罕见作用模式配体的新型化学信息学工作流程
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-23 DOI: 10.1186/s13321-024-00901-5
Sven Marcel Stefan, Katja Stefan, Vigneshwaran Namasivayam

The identification, establishment, and exploration of potential pharmacological drug targets are major steps of the drug development pipeline. Target validation requires diverse chemical tools that come with a spectrum of functionality, e.g., inhibitors, activators, and other modulators. Particularly tools with rare modes-of-action allow for a proper kinetic and functional characterization of the targets-of-interest (e.g., channels, enzymes, receptors, or transporters). Despite, functional innovation is a prime criterion for patentability and commercial exploitation, which may lead to therapeutic benefit. Unfortunately, data on new, and thus, undruggable or barely druggable targets are scarce and mostly available for mainstream modes-of-action only (e.g., inhibition). Here we present a novel cheminformatic workflow—computer-aided pattern scoring (C@PS)—which was specifically designed to project its prediction capabilities into an uncharted domain of applicability.

潜在药理药物靶点的识别、确立和探索是药物开发流程的主要步骤。靶点验证需要多种化学工具,如抑制剂、激活剂和其他调节剂等。特别是具有罕见作用模式的工具,可以对感兴趣的靶点(如通道、酶、受体或转运体)进行适当的动力学和功能表征。尽管功能创新是获得专利和商业开发的首要标准,但这可能会带来治疗效果。遗憾的是,有关新靶点的数据非常稀少,因此也就无法用药或几乎无法用药,而且大多只有主流作用方式(如抑制)的数据。在此,我们介绍一种新颖的化学信息学工作流程--计算机辅助模式评分(C@PS)--该流程专门设计用于将其预测能力投射到一个未知的适用领域。所介绍的工作流程首次解决了数据稀缺的难题,尤其是针对罕见作用模式。此外,该工作流程和相关数据集为合理选择候选药物的标准定义和应用提供了新的标准,解决了化学信息学、计算化学和药物化学领域的重要空白。
{"title":"Computer-aided pattern scoring (C@PS): a novel cheminformatic workflow to predict ligands with rare modes-of-action","authors":"Sven Marcel Stefan,&nbsp;Katja Stefan,&nbsp;Vigneshwaran Namasivayam","doi":"10.1186/s13321-024-00901-5","DOIUrl":"10.1186/s13321-024-00901-5","url":null,"abstract":"<div><p>The identification, establishment, and exploration of potential pharmacological drug targets are major steps of the drug development pipeline. Target validation requires diverse chemical tools that come with a spectrum of functionality, <i>e.g.</i>, inhibitors, activators, and other modulators. Particularly tools with rare modes-of-action allow for a proper kinetic and functional characterization of the targets-of-interest (<i>e.g.</i>, channels, enzymes, receptors, or transporters). Despite, functional innovation is a prime criterion for patentability and commercial exploitation, which may lead to therapeutic benefit. Unfortunately, data on new, and thus, undruggable or barely druggable targets are scarce and mostly available for mainstream modes-of-action only (<i>e.g.</i>, inhibition). Here we present a novel cheminformatic workflow—computer-aided pattern scoring (C@PS)—which was specifically designed to project its prediction capabilities into an uncharted domain of applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00901-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency EC-Conf:等变一致性分子构象生成超快扩散模型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-03 DOI: 10.1186/s13321-024-00893-2
Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen

Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. Here, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling.

尽管最近在扩散模型驱动的三维分子构象生成方面取得了进展,但其在迭代扩散/变色过程中的高计算成本限制了其应用。在此,我们提出了等变一致性模型(EC-Conf),作为低能构象生成的快速扩散方法。在 EC-Conf 中,直接使用改进的 SE (3)- 等变变换器模型来编码笛卡尔分子构象,并通过高效的一致性扩散过程来生成分子构象。结果表明,只需一个采样步骤,它就能达到与其他数千个去噪步骤的基于扩散的模型相当的质量。如果多进行几次采样迭代,其性能还能进一步提高。我们在 GEOM-QM9 和 GEOM-Drugs 集上对 EC-Conf 的性能进行了评估。我们的结果表明,EC-Conf 学习低能分子构象分布的效率比当前的 SOTA 扩散模型至少高出两个量级,有可能成为构象生成和采样的有用工具。科学贡献:在这项工作中,我们提出了一种等变一致性模型,它能显著提高基于扩散模型的构象生成效率,同时保持较高的结构质量。该方法可作为一个通用框架,并可在未来的步骤中进一步扩展到更复杂的结构生成和预测任务,包括涉及蛋白质的任务。
{"title":"EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency","authors":"Zhiguang Fan,&nbsp;Yuedong Yang,&nbsp;Mingyuan Xu,&nbsp;Hongming Chen","doi":"10.1186/s13321-024-00893-2","DOIUrl":"10.1186/s13321-024-00893-2","url":null,"abstract":"<p>Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. Here, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00893-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RAIChU: automating the visualisation of natural product biosynthesis RAIChU:实现天然产物生物合成的自动化可视化
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-03 DOI: 10.1186/s13321-024-00898-x
Barbara R. Terlouw, Friederike Biermann, Sophie P. J. M. Vromans, Elham Zamani, Eric J. N. Helfrich, Marnix H. Medema
<div><p>Natural products are molecules that fulfil a range of important ecological functions. Many natural products have been exploited for pharmaceutical and agricultural applications. In contrast to many other specialised metabolites, the products of modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) systems can often (partially) be predicted from the DNA sequence of the biosynthetic gene clusters. This is because the biosynthetic pathways of NRPS and PKS systems adhere to consistent rulesets. These universal biosynthetic rules can be leveraged to generate biosynthetic models of biosynthetic pathways. While these principles have been largely deciphered, software that leverages these rules to automatically generate visualisations of biosynthetic models has not yet been developed. To enable high-quality automated visualisations of natural product biosynthetic pathways, we developed RAIChU (Reaction Analysis through Illustrating Chemical Units), which produces depictions of biosynthetic transformations of PKS, NRPS, and hybrid PKS/NRPS systems from predicted or experimentally verified module architectures and domain substrate specificities. RAIChU also boasts a library of functions to perform and visualise reactions and pathways whose specifics (e.g., regioselectivity, stereoselectivity) are still difficult to predict, including terpenes, ribosomally synthesised and posttranslationally modified peptides and alkaloids. Additionally, RAIChU includes 34 prevalent tailoring reactions to enable the visualisation of biosynthetic pathways of fully maturated natural products. RAIChU can be integrated into Python pipelines, allowing users to upload and edit results from antiSMASH, a widely used BGC detection and annotation tool, or to build biosynthetic PKS/NRPS systems from scratch. RAIChU’s cluster drawing correctness (100%) and drawing readability (97.66%) were validated on 5000 randomly generated PKS/NRPS systems, and on the MIBiG database. The automated visualisation of these pathways accelerates the generation of biosynthetic models, facilitates the analysis of large (meta-) genomic datasets and reduces human error. RAIChU is available at https://github.com/BTheDragonMaster/RAIChU and https://pypi.org/project/raichu.</p><p><b>Scientific contribution</b></p><p>RAIChU is the first software package capable of automating high-quality visualisations of natural product biosynthetic pathways. By leveraging universal biosynthetic rules, RAIChU enables the depiction of complex biosynthetic transformations for PKS, NRPS, ribosomally synthesised and posttranslationally modified peptide (RiPP), terpene and alkaloid systems, enhancing predictive and analytical capabilities. This innovation not only streamlines the creation of biosynthetic models, making the analysis of large genomic datasets more efficient and accurate, but also bridges a crucial gap in predicting and visualising the complexities of natural product biosynthesis.</p></div
天然产品是具有一系列重要生态功能的分子。许多天然产物已被用于制药和农业用途。与许多其他专门的代谢产物不同,模块化非核糖体肽合成酶(NRPS)和多酮肽合成酶(PKS)系统的产物通常(部分)可以从生物合成基因簇的 DNA 序列中预测出来。这是因为 NRPS 和 PKS 系统的生物合成途径遵循一致的规则集。这些通用的生物合成规则可用于生成生物合成途径的生物合成模型。虽然这些原则已基本被破解,但利用这些规则自动生成生物合成模型可视化的软件尚未开发出来。为了实现天然产物生物合成途径的高质量自动可视化,我们开发了 RAIChU(通过说明化学单元进行反应分析),它可以根据预测或实验验证的模块架构和域底物特异性,生成 PKS、NRPS 和混合 PKS/NRPS 系统生物合成转化的描述。RAIChU 还拥有一个功能库,用于执行和可视化那些具体细节(如区域选择性、立体选择性)仍然难以预测的反应和途径,包括萜类、核糖体合成和翻译后修饰的肽和生物碱。此外,RAIChU 还包括 34 种常见的定制反应,可实现完全成熟的天然产品生物合成途径的可视化。RAIChU 可集成到 Python 管道中,允许用户上传和编辑来自反SMASH(一种广泛使用的 BGC 检测和注释工具)的结果,或从头开始构建生物合成 PKS/NRPS 系统。RAIChU 的聚类绘制正确率(100%)和绘制可读性(97.66%)在 5000 个随机生成的 PKS/NRPS 系统和 MIBiG 数据库上得到了验证。这些通路的自动可视化加快了生物合成模型的生成,促进了大型(元)基因组数据集的分析,并减少了人为错误。RAIChU 可在 https://github.com/BTheDragonMaster/RAIChU 和 https://pypi.org/project/raichu.Scientific 上下载。RAIChU 是第一个能够自动实现天然产物生物合成途径高质量可视化的软件包。通过利用通用生物合成规则,RAIChU 能够描述 PKS、NRPS、核糖体合成和翻译后修饰肽 (RiPP)、萜烯和生物碱系统的复杂生物合成转化,从而提高预测和分析能力。这项创新不仅简化了生物合成模型的创建过程,使大型基因组数据集的分析更加高效和准确,而且弥补了天然产物生物合成复杂性预测和可视化方面的重要空白。
{"title":"RAIChU: automating the visualisation of natural product biosynthesis","authors":"Barbara R. Terlouw,&nbsp;Friederike Biermann,&nbsp;Sophie P. J. M. Vromans,&nbsp;Elham Zamani,&nbsp;Eric J. N. Helfrich,&nbsp;Marnix H. Medema","doi":"10.1186/s13321-024-00898-x","DOIUrl":"10.1186/s13321-024-00898-x","url":null,"abstract":"&lt;div&gt;&lt;p&gt;Natural products are molecules that fulfil a range of important ecological functions. Many natural products have been exploited for pharmaceutical and agricultural applications. In contrast to many other specialised metabolites, the products of modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) systems can often (partially) be predicted from the DNA sequence of the biosynthetic gene clusters. This is because the biosynthetic pathways of NRPS and PKS systems adhere to consistent rulesets. These universal biosynthetic rules can be leveraged to generate biosynthetic models of biosynthetic pathways. While these principles have been largely deciphered, software that leverages these rules to automatically generate visualisations of biosynthetic models has not yet been developed. To enable high-quality automated visualisations of natural product biosynthetic pathways, we developed RAIChU (Reaction Analysis through Illustrating Chemical Units), which produces depictions of biosynthetic transformations of PKS, NRPS, and hybrid PKS/NRPS systems from predicted or experimentally verified module architectures and domain substrate specificities. RAIChU also boasts a library of functions to perform and visualise reactions and pathways whose specifics (e.g., regioselectivity, stereoselectivity) are still difficult to predict, including terpenes, ribosomally synthesised and posttranslationally modified peptides and alkaloids. Additionally, RAIChU includes 34 prevalent tailoring reactions to enable the visualisation of biosynthetic pathways of fully maturated natural products. RAIChU can be integrated into Python pipelines, allowing users to upload and edit results from antiSMASH, a widely used BGC detection and annotation tool, or to build biosynthetic PKS/NRPS systems from scratch. RAIChU’s cluster drawing correctness (100%) and drawing readability (97.66%) were validated on 5000 randomly generated PKS/NRPS systems, and on the MIBiG database. The automated visualisation of these pathways accelerates the generation of biosynthetic models, facilitates the analysis of large (meta-) genomic datasets and reduces human error. RAIChU is available at https://github.com/BTheDragonMaster/RAIChU and https://pypi.org/project/raichu.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Scientific contribution&lt;/b&gt;&lt;/p&gt;&lt;p&gt;RAIChU is the first software package capable of automating high-quality visualisations of natural product biosynthetic pathways. By leveraging universal biosynthetic rules, RAIChU enables the depiction of complex biosynthetic transformations for PKS, NRPS, ribosomally synthesised and posttranslationally modified peptide (RiPP), terpene and alkaloid systems, enhancing predictive and analytical capabilities. This innovation not only streamlines the creation of biosynthetic models, making the analysis of large genomic datasets more efficient and accurate, but also bridges a crucial gap in predicting and visualising the complexities of natural product biosynthesis.&lt;/p&gt;&lt;/div","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00898-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the generalizability of graph neural networks for predicting collision cross section 评估图神经网络预测碰撞截面的通用性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-29 DOI: 10.1186/s13321-024-00899-w
Chloe Engler Hart, António José Preto, Shaurya Chanana, David Healey, Tobias Kind, Daniel Domingo-Fernández

Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.

Scientific contribution

We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.

离子迁移率与质谱联用(IM-MS)是一种很有前途的分析技术,它通过测量碰撞截面(CCS)值来提高分子表征能力,而碰撞截面值是分子大小和形状的指标。然而,由于实验数据有限,CCS 值在结构分析中的有效应用仍然受到限制,因此有必要开发精确的机器学习(ML)模型进行硅学预测。在本研究中,我们评估了最先进的图神经网络(GNN),并使用迄今为止最大的公开可用数据集对其进行了训练,以预测 CCS 值。尽管我们的结果证实了这些模型在与其训练环境相似的化学空间内具有很高的准确性,但当它们应用于结构新颖的区域时,其性能却明显下降。这种差异引起了人们对硅学 CCS 预测可靠性的担忧,并强调了进一步发布公开 CCS 数据集的必要性。为了缓解这一问题,我们引入了 Mol2CCS,它展示了如何通过扩展模型来考虑分子指纹、描述符和分子类型等额外特征,从而部分提高通用性。最后,我们还展示了置信模型如何通过提高 CCS 估计值的可靠性来提供支持。科学贡献 我们对用于预测碰撞截面的最先进图神经网络进行了基准测试。我们的工作强调了这些模型在类似化学空间中进行训练和预测时的准确性,但也强调了在结构新颖的区域中进行评估时其准确性是如何下降的。最后,我们提出了缓解这一问题的潜在方法。
{"title":"Evaluating the generalizability of graph neural networks for predicting collision cross section","authors":"Chloe Engler Hart,&nbsp;António José Preto,&nbsp;Shaurya Chanana,&nbsp;David Healey,&nbsp;Tobias Kind,&nbsp;Daniel Domingo-Fernández","doi":"10.1186/s13321-024-00899-w","DOIUrl":"10.1186/s13321-024-00899-w","url":null,"abstract":"<div><p>Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.</p><p><b>Scientific contribution</b></p><p>We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00899-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1