Bin Yuan, Chen Zhang, Chaoqun Ji, Gefeng Liu, Xiaoyi Li, Shanshan Gong, Xiangsheng Huang, Aijin Shen, Xiaonong Li, Yanfang Liu
{"title":"HSQCid: A Powerful Tool for Paving the Way to High-Throughput Structural Dereplication of Natural Products Based on Fast NMR Experiments","authors":"Bin Yuan, Chen Zhang, Chaoqun Ji, Gefeng Liu, Xiaoyi Li, Shanshan Gong, Xiangsheng Huang, Aijin Shen, Xiaonong Li, Yanfang Liu","doi":"10.1021/acs.analchem.4c03102","DOIUrl":null,"url":null,"abstract":"Structural dereplication is an essential step in the study of natural products (NPs). The number of discovered NPs is so large that efficient dereplication is highly desirable. NMR spectroscopy is still the gold standard for structural identification. <sup>13</sup>C NMR spectrum is an effective molecular fingerprint, but their acquisition is time-consuming, especially for mass-limited NPs. Several alternative methods or tools have been proposed but have never reached general use for some reasons. Here, a new artificial intelligence tool, HSQCid, using contrastive learning between <sup>1</sup>H–<sup>13</sup>C HSQC spectra and structures, is proposed for effective structural identification. Two structure encoders are compared, and the graph neural network is preferred over the Transformer. In this way, 80% and 20% of about 400,000 predicted data could be used for training and testing, respectively. Besides, with 17,971 experimental data as external test data, the top-1 and top-10 accuracies reach 74.5% and 94.8%, respectively. Top-1 accuracy increases by at least 12% when combined with other easily obtainable structure features, such as the total number of hydrogens connected to carbons from <sup>1</sup>H NMR spectra. Further data analysis shows that the filters by structure features nearly eliminate the influence (>10%) of the difference between predicted and experimental data. Surprisingly, the influence of the number or the ratio of nonprotonated carbons on the identification accuracy is only significant in specific and rare cases (2.65%). Furthermore, the benchmark method, which matches <sup>13</sup>C peaks, is compared and is markedly inferior to the proposed method, with or without structural features. The HSQCid code is available online. It is believed that HSQCid contributes to paving the way for high-throughput or highly effective structural dereplication of NPs.","PeriodicalId":27,"journal":{"name":"Analytical Chemistry","volume":"11 1","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.analchem.4c03102","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Structural dereplication is an essential step in the study of natural products (NPs). The number of discovered NPs is so large that efficient dereplication is highly desirable. NMR spectroscopy is still the gold standard for structural identification. 13C NMR spectrum is an effective molecular fingerprint, but their acquisition is time-consuming, especially for mass-limited NPs. Several alternative methods or tools have been proposed but have never reached general use for some reasons. Here, a new artificial intelligence tool, HSQCid, using contrastive learning between 1H–13C HSQC spectra and structures, is proposed for effective structural identification. Two structure encoders are compared, and the graph neural network is preferred over the Transformer. In this way, 80% and 20% of about 400,000 predicted data could be used for training and testing, respectively. Besides, with 17,971 experimental data as external test data, the top-1 and top-10 accuracies reach 74.5% and 94.8%, respectively. Top-1 accuracy increases by at least 12% when combined with other easily obtainable structure features, such as the total number of hydrogens connected to carbons from 1H NMR spectra. Further data analysis shows that the filters by structure features nearly eliminate the influence (>10%) of the difference between predicted and experimental data. Surprisingly, the influence of the number or the ratio of nonprotonated carbons on the identification accuracy is only significant in specific and rare cases (2.65%). Furthermore, the benchmark method, which matches 13C peaks, is compared and is markedly inferior to the proposed method, with or without structural features. The HSQCid code is available online. It is believed that HSQCid contributes to paving the way for high-throughput or highly effective structural dereplication of NPs.
期刊介绍:
Analytical Chemistry, a peer-reviewed research journal, focuses on disseminating new and original knowledge across all branches of analytical chemistry. Fundamental articles may explore general principles of chemical measurement science and need not directly address existing or potential analytical methodology. They can be entirely theoretical or report experimental results. Contributions may cover various phases of analytical operations, including sampling, bioanalysis, electrochemistry, mass spectrometry, microscale and nanoscale systems, environmental analysis, separations, spectroscopy, chemical reactions and selectivity, instrumentation, imaging, surface analysis, and data processing. Papers discussing known analytical methods should present a significant, original application of the method, a notable improvement, or results on an important analyte.