{"title":"PAH-Finder: A Pattern Recognition Workflow for Identification of PAHs and Their Derivatives","authors":"Zixuan Zhang, Xin Xu, Shipei Xing, Changzhi Shi, Zecang You, Xiaojun Deng, Ling Tan, Zhe Mo, Mingliang Fang","doi":"10.1021/acs.analchem.4c04249","DOIUrl":null,"url":null,"abstract":"Polycyclic aromatic hydrocarbons (PAHs) are pervasive environmental pollutants with significant health risks due to their carcinogenic, mutagenic, and teratogenic properties. Traditional methods for PAH identification, primarily relying on gas chromatography–mass spectrometry (GC–MS), utilize spectral library searches together with other techniques, such as mass defect analysis. However, these methods are limited by incomplete spectral libraries and a high false positive rate. Here, we present PAH-Finder, a data-driven workflow that integrates machine learning with high-resolution mass spectrometry (HRMS). PAH-Finder introduces a novel approach to evaluate the fragment distribution of PAH backbones in MS spectra by normalizing fragment <i>m</i>/<i>z</i> values to a 0–100% range relative to the molecular ion peak. Seven machine learning features capture PAH fragmentation characteristics, and a random forest model trained on 98 PAH spectra and 1003 background spectra achieved an F1 score of ∼0.9 in 5-fold cross validation. Additionally, PAH-Finder leverages the presence of doubly charged fragments and molecular formula prediction to enhance the identification accuracy. In a case study, PAH-Finder identified 135 PAHs, including 7 types of previously unreported PAH formulas in particulate matter samples, demonstrating a 246% increase in annotation efficiency compared to the NIST20 library search. It also identified 32 heteroatom-doped PAHs not included in the training data set, showcasing its robustness of generalization. PAH-Finder’s high accuracy in detecting a broad spectrum of PAHs facilitates efficient data processing and interpretation for nontargeted analysis, enhancing our understanding of air pollution and public health protection. PAH-Finder is freely available at Github (https://github.com/FangLabNTU/PAH-Finder).","PeriodicalId":27,"journal":{"name":"Analytical Chemistry","volume":"30 1","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.analchem.4c04249","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Polycyclic aromatic hydrocarbons (PAHs) are pervasive environmental pollutants with significant health risks due to their carcinogenic, mutagenic, and teratogenic properties. Traditional methods for PAH identification, primarily relying on gas chromatography–mass spectrometry (GC–MS), utilize spectral library searches together with other techniques, such as mass defect analysis. However, these methods are limited by incomplete spectral libraries and a high false positive rate. Here, we present PAH-Finder, a data-driven workflow that integrates machine learning with high-resolution mass spectrometry (HRMS). PAH-Finder introduces a novel approach to evaluate the fragment distribution of PAH backbones in MS spectra by normalizing fragment m/z values to a 0–100% range relative to the molecular ion peak. Seven machine learning features capture PAH fragmentation characteristics, and a random forest model trained on 98 PAH spectra and 1003 background spectra achieved an F1 score of ∼0.9 in 5-fold cross validation. Additionally, PAH-Finder leverages the presence of doubly charged fragments and molecular formula prediction to enhance the identification accuracy. In a case study, PAH-Finder identified 135 PAHs, including 7 types of previously unreported PAH formulas in particulate matter samples, demonstrating a 246% increase in annotation efficiency compared to the NIST20 library search. It also identified 32 heteroatom-doped PAHs not included in the training data set, showcasing its robustness of generalization. PAH-Finder’s high accuracy in detecting a broad spectrum of PAHs facilitates efficient data processing and interpretation for nontargeted analysis, enhancing our understanding of air pollution and public health protection. PAH-Finder is freely available at Github (https://github.com/FangLabNTU/PAH-Finder).
期刊介绍:
Analytical Chemistry, a peer-reviewed research journal, focuses on disseminating new and original knowledge across all branches of analytical chemistry. Fundamental articles may explore general principles of chemical measurement science and need not directly address existing or potential analytical methodology. They can be entirely theoretical or report experimental results. Contributions may cover various phases of analytical operations, including sampling, bioanalysis, electrochemistry, mass spectrometry, microscale and nanoscale systems, environmental analysis, separations, spectroscopy, chemical reactions and selectivity, instrumentation, imaging, surface analysis, and data processing. Papers discussing known analytical methods should present a significant, original application of the method, a notable improvement, or results on an important analyte.