{"title":"A novel ensemble machine learning exposure model system for ground-level ozone at the national scale: A case of mainland China from 2013 to 2020","authors":"Jiawei Wang","doi":"10.1016/j.eiar.2024.107630","DOIUrl":null,"url":null,"abstract":"<div><p>In epidemiological research, accurate estimation of historical ground-level ozone (O<sub>3</sub>) concentrations with enhanced spatiotemporal resolution is crucial for effective exposure assessment. The current state-of-the-art for estimating air pollutant concentrations is a two-stage ensemble method that integrates outputs from multiple machine learning algorithms. Despite its effectiveness, opportunities exist to refine this approach for more precise O<sub>3</sub> estimation. In this study, we propose an enhanced ensemble method that incorporates four key strategies. First, we employ high-resolution spatiotemporal predictors derived from prior machine learning studies for refined secondary learning. Second, we use sophisticated algorithms, including categorical gradient boosting, deep neural network, random forest, stochastic variable Gaussian process, transformer, and a combination of convolutional neural network and long short-term memory neural network, as sublearners to enhance learning capabilities. Third, we spatiotemporally split the sample set and then train submodels separately on each subset to eliminate the unobserved spatiotemporal heterogeneity. Finally, we apply a complex machine learning algorithm, rather than the generalized additive model, for integrating sublearner predictions, enabling the capture of intricate nonlinear relationships beyond basic spatiotemporal linear weights. To validate these improvements, we estimated daily maximum 8-h moving average O<sub>3</sub> concentrations ([O<sub>3</sub>]MDA8) across Chinese mainland from 2013 to 2020 at a 1 km spatial resolution. The proposed method demonstrated notable accuracy, achieving an out-of-station determination coefficient (R<sup>2</sup>) of 0.943 and a root-mean-square error (RMSE) of 10.197 μg/m<sup>3</sup>. This performance marks a nearly 15% improvement over the best existing Chinese O<sub>3</sub> exposure model based on a single algorithm and also surpasses previous studies utilizing traditional ensemble methods for other air pollutants. Our enhanced ensemble approach significantly bolsters the reliability and robustness of future environmental epidemiological studies by further mitigating “misclassification” errors.</p></div>","PeriodicalId":309,"journal":{"name":"Environmental Impact Assessment Review","volume":"109 ","pages":"Article 107630"},"PeriodicalIF":9.8000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Impact Assessment Review","FirstCategoryId":"90","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0195925524002178","RegionNum":1,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL STUDIES","Score":null,"Total":0}
引用次数: 0
Abstract
In epidemiological research, accurate estimation of historical ground-level ozone (O3) concentrations with enhanced spatiotemporal resolution is crucial for effective exposure assessment. The current state-of-the-art for estimating air pollutant concentrations is a two-stage ensemble method that integrates outputs from multiple machine learning algorithms. Despite its effectiveness, opportunities exist to refine this approach for more precise O3 estimation. In this study, we propose an enhanced ensemble method that incorporates four key strategies. First, we employ high-resolution spatiotemporal predictors derived from prior machine learning studies for refined secondary learning. Second, we use sophisticated algorithms, including categorical gradient boosting, deep neural network, random forest, stochastic variable Gaussian process, transformer, and a combination of convolutional neural network and long short-term memory neural network, as sublearners to enhance learning capabilities. Third, we spatiotemporally split the sample set and then train submodels separately on each subset to eliminate the unobserved spatiotemporal heterogeneity. Finally, we apply a complex machine learning algorithm, rather than the generalized additive model, for integrating sublearner predictions, enabling the capture of intricate nonlinear relationships beyond basic spatiotemporal linear weights. To validate these improvements, we estimated daily maximum 8-h moving average O3 concentrations ([O3]MDA8) across Chinese mainland from 2013 to 2020 at a 1 km spatial resolution. The proposed method demonstrated notable accuracy, achieving an out-of-station determination coefficient (R2) of 0.943 and a root-mean-square error (RMSE) of 10.197 μg/m3. This performance marks a nearly 15% improvement over the best existing Chinese O3 exposure model based on a single algorithm and also surpasses previous studies utilizing traditional ensemble methods for other air pollutants. Our enhanced ensemble approach significantly bolsters the reliability and robustness of future environmental epidemiological studies by further mitigating “misclassification” errors.
期刊介绍:
Environmental Impact Assessment Review is an interdisciplinary journal that serves a global audience of practitioners, policymakers, and academics involved in assessing the environmental impact of policies, projects, processes, and products. The journal focuses on innovative theory and practice in environmental impact assessment (EIA). Papers are expected to present innovative ideas, be topical, and coherent. The journal emphasizes concepts, methods, techniques, approaches, and systems related to EIA theory and practice.