Composite machine learning strategy for natural products taxonomical classification and structural insights†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Digital discovery Pub Date : 2024-09-23 DOI:10.1039/D4DD00155A
Qisong Xu, Alan K. X. Tan, Liangfeng Guo, Yee Hwee Lim, Dillon W. P. Tay and Shi Jun Ang
{"title":"Composite machine learning strategy for natural products taxonomical classification and structural insights†","authors":"Qisong Xu, Alan K. X. Tan, Liangfeng Guo, Yee Hwee Lim, Dillon W. P. Tay and Shi Jun Ang","doi":"10.1039/D4DD00155A","DOIUrl":null,"url":null,"abstract":"<p >Taxonomical classification of natural products (NPs) can assist in genomic and phylogenetic analysis of source organisms and facilitate streamlining of bioprospecting efforts. Here, a composite machine learning strategy marrying graph convolutional neural networks (GCNNs) and eXteme Gradient boosting (XGB) is proposed and validated for taxonomical classification of NPs in five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). Our composite model, trained on 133 092 NPs from the LOTUS database, achieved five-fold cross-validated classification accuracy of 97.4%. When employed to classify out-of-sample NPs from the NP Atlas database, accuracies of 82.8% for bacteria and 86.6% for fungi were obtained. Dimensionality-reduced representations of the molecular embeddings from our composite model revealed distinct clusters of NPs that suggest a basis for enhanced classification performance. The top critical substructures from the NPs of each kingdom were also identified and compared to provide insights on structure–taxonomy relationships. Overall, this study showcases the potential of composite machine learning models for robust taxonomical classification of NPs, which can streamline discovery of NPs.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":6.2000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00155a?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/dd/d4dd00155a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Taxonomical classification of natural products (NPs) can assist in genomic and phylogenetic analysis of source organisms and facilitate streamlining of bioprospecting efforts. Here, a composite machine learning strategy marrying graph convolutional neural networks (GCNNs) and eXteme Gradient boosting (XGB) is proposed and validated for taxonomical classification of NPs in five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). Our composite model, trained on 133 092 NPs from the LOTUS database, achieved five-fold cross-validated classification accuracy of 97.4%. When employed to classify out-of-sample NPs from the NP Atlas database, accuracies of 82.8% for bacteria and 86.6% for fungi were obtained. Dimensionality-reduced representations of the molecular embeddings from our composite model revealed distinct clusters of NPs that suggest a basis for enhanced classification performance. The top critical substructures from the NPs of each kingdom were also identified and compared to provide insights on structure–taxonomy relationships. Overall, this study showcases the potential of composite machine learning models for robust taxonomical classification of NPs, which can streamline discovery of NPs.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
天然产品分类和结构洞察的复合机器学习策略†
对天然产物(NPs)进行分类有助于对源生物进行基因组和系统发育分析,并有助于简化生物勘探工作。本文提出了一种将图卷积神经网络(GCNN)和梯度提升技术(XGB)结合起来的复合机器学习策略,并对五界(动物界、细菌界、染色体界、真菌界和植物界)的天然产物分类进行了验证。我们的复合模型是在 LOTUS 数据库的 133 092 个 NPs 上训练出来的,经过五倍交叉验证,分类准确率达到 97.4%。在对 NP Atlas 数据库中的样本外 NP 进行分类时,细菌和真菌的准确率分别为 82.8% 和 86.6%。我们的复合模型中分子嵌入的降维表示法揭示了NPs的独特群集,为提高分类性能提供了基础。此外,我们还识别并比较了每个生物界 NPs 中最重要的子结构,从而为结构-分类关系提供了深入的见解。总之,这项研究展示了复合机器学习模型在对 NPs 进行稳健分类方面的潜力,它可以简化 NPs 的发现过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
期刊最新文献
Back cover Sorting polyolefins with near-infrared spectroscopy: identification of optimal data analysis pipelines and machine learning classifiers†‡ High accuracy uncertainty-aware interatomic force modeling with equivariant Bayesian neural networks† Correction: A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing Artificial intelligence-enabled optimization of battery-grade lithium carbonate production†
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1