HSQCid: A Powerful Tool for Paving the Way to High-Throughput Structural Dereplication of Natural Products Based on Fast NMR Experiments

IF 6.7 1区 化学 Q1 CHEMISTRY, ANALYTICAL Analytical Chemistry Pub Date : 2025-02-05 DOI:10.1021/acs.analchem.4c03102
Bin Yuan, Chen Zhang, Chaoqun Ji, Gefeng Liu, Xiaoyi Li, Shanshan Gong, Xiangsheng Huang, Aijin Shen, Xiaonong Li, Yanfang Liu
{"title":"HSQCid: A Powerful Tool for Paving the Way to High-Throughput Structural Dereplication of Natural Products Based on Fast NMR Experiments","authors":"Bin Yuan, Chen Zhang, Chaoqun Ji, Gefeng Liu, Xiaoyi Li, Shanshan Gong, Xiangsheng Huang, Aijin Shen, Xiaonong Li, Yanfang Liu","doi":"10.1021/acs.analchem.4c03102","DOIUrl":null,"url":null,"abstract":"Structural dereplication is an essential step in the study of natural products (NPs). The number of discovered NPs is so large that efficient dereplication is highly desirable. NMR spectroscopy is still the gold standard for structural identification. <sup>13</sup>C NMR spectrum is an effective molecular fingerprint, but their acquisition is time-consuming, especially for mass-limited NPs. Several alternative methods or tools have been proposed but have never reached general use for some reasons. Here, a new artificial intelligence tool, HSQCid, using contrastive learning between <sup>1</sup>H–<sup>13</sup>C HSQC spectra and structures, is proposed for effective structural identification. Two structure encoders are compared, and the graph neural network is preferred over the Transformer. In this way, 80% and 20% of about 400,000 predicted data could be used for training and testing, respectively. Besides, with 17,971 experimental data as external test data, the top-1 and top-10 accuracies reach 74.5% and 94.8%, respectively. Top-1 accuracy increases by at least 12% when combined with other easily obtainable structure features, such as the total number of hydrogens connected to carbons from <sup>1</sup>H NMR spectra. Further data analysis shows that the filters by structure features nearly eliminate the influence (&gt;10%) of the difference between predicted and experimental data. Surprisingly, the influence of the number or the ratio of nonprotonated carbons on the identification accuracy is only significant in specific and rare cases (2.65%). Furthermore, the benchmark method, which matches <sup>13</sup>C peaks, is compared and is markedly inferior to the proposed method, with or without structural features. The HSQCid code is available online. It is believed that HSQCid contributes to paving the way for high-throughput or highly effective structural dereplication of NPs.","PeriodicalId":27,"journal":{"name":"Analytical Chemistry","volume":"11 1","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.analchem.4c03102","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Structural dereplication is an essential step in the study of natural products (NPs). The number of discovered NPs is so large that efficient dereplication is highly desirable. NMR spectroscopy is still the gold standard for structural identification. 13C NMR spectrum is an effective molecular fingerprint, but their acquisition is time-consuming, especially for mass-limited NPs. Several alternative methods or tools have been proposed but have never reached general use for some reasons. Here, a new artificial intelligence tool, HSQCid, using contrastive learning between 1H–13C HSQC spectra and structures, is proposed for effective structural identification. Two structure encoders are compared, and the graph neural network is preferred over the Transformer. In this way, 80% and 20% of about 400,000 predicted data could be used for training and testing, respectively. Besides, with 17,971 experimental data as external test data, the top-1 and top-10 accuracies reach 74.5% and 94.8%, respectively. Top-1 accuracy increases by at least 12% when combined with other easily obtainable structure features, such as the total number of hydrogens connected to carbons from 1H NMR spectra. Further data analysis shows that the filters by structure features nearly eliminate the influence (>10%) of the difference between predicted and experimental data. Surprisingly, the influence of the number or the ratio of nonprotonated carbons on the identification accuracy is only significant in specific and rare cases (2.65%). Furthermore, the benchmark method, which matches 13C peaks, is compared and is markedly inferior to the proposed method, with or without structural features. The HSQCid code is available online. It is believed that HSQCid contributes to paving the way for high-throughput or highly effective structural dereplication of NPs.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
HSQCid:为基于快速核磁共振实验的天然产物高通量结构分离铺平道路的强大工具
结构重复是天然产物(NPs)研究的重要环节。发现的np数量如此之大,因此非常需要有效的反复制。核磁共振波谱仍然是结构鉴定的金标准。13C核磁共振谱是一种有效的分子指纹,但其获取时间较长,特别是对于质量受限的NPs。已经提出了几种替代方法或工具,但由于某些原因从未得到普遍使用。本文提出了一种新的人工智能工具HSQCid,利用1H-13C HSQC光谱和结构之间的对比学习,进行有效的结构识别。比较了两种结构编码器,图神经网络编码器优于变压器编码器。这样,大约40万个预测数据中的80%和20%可以分别用于训练和测试。此外,以17971个实验数据作为外部测试数据,前1名和前10名的准确率分别达到74.5%和94.8%。当结合其他容易获得的结构特征(如1H NMR光谱中与碳相连的氢的总数)时,Top-1的精度至少提高了12%。进一步的数据分析表明,基于结构特征的滤波器几乎消除了预测数据与实验数据差异的影响(>10%)。令人惊讶的是,非质子化碳的数量或比例对识别精度的影响仅在特定和罕见的情况下显著(2.65%)。此外,比较了与13C峰匹配的基准方法,无论是否具有结构特征,都明显低于所提出的方法。HSQCid代码可在线获取。认为HSQCid有助于为高通量或高效的NPs结构去复制铺平道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Analytical Chemistry
Analytical Chemistry 化学-分析化学
CiteScore
12.10
自引率
12.20%
发文量
1949
审稿时长
1.4 months
期刊介绍: Analytical Chemistry, a peer-reviewed research journal, focuses on disseminating new and original knowledge across all branches of analytical chemistry. Fundamental articles may explore general principles of chemical measurement science and need not directly address existing or potential analytical methodology. They can be entirely theoretical or report experimental results. Contributions may cover various phases of analytical operations, including sampling, bioanalysis, electrochemistry, mass spectrometry, microscale and nanoscale systems, environmental analysis, separations, spectroscopy, chemical reactions and selectivity, instrumentation, imaging, surface analysis, and data processing. Papers discussing known analytical methods should present a significant, original application of the method, a notable improvement, or results on an important analyte.
期刊最新文献
Cascaded Improved Neural Network for the Reconstruction, Classification, and Unmixing of the Raman Spectra of Mixed Microplastics. Electrochemically Driven Directional Chirality Control for Surface-Confined Pillararenes and Its Application in Enantioselective Recognition. SPR-Amplified TiO2-NanoMXene Heterojunction for Ultrasensitive Contactless Human Activity Recognition. Semisupervised Learning Process Based on a Laplacian Regularized One Class Support Vector Machine with Dynamic Decision Rule for Near-Infrared Data Classification. Development and Applications of Electrochemical Surface Plasmon Resonance (EC-SPR)-Based Sensors: A Review.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1