IPB-MSA&SO4: a daily 0.25° resolution dataset of in situ-produced biogenic methanesulfonic acid and sulfate over the North Atlantic during 1998–2022 based on machine learning
Karam Mansour, Stefano Decesari, Darius Ceburnis, Jurgita Ovadnevaite, Lynn M. Russell, Marco Paglione, Laurent Poulain, Shan Huang, Colin O'Dowd, Matteo Rinaldi
{"title":"IPB-MSA&SO4: a daily 0.25° resolution dataset of in situ-produced biogenic methanesulfonic acid and sulfate over the North Atlantic during 1998–2022 based on machine learning","authors":"Karam Mansour, Stefano Decesari, Darius Ceburnis, Jurgita Ovadnevaite, Lynn M. Russell, Marco Paglione, Laurent Poulain, Shan Huang, Colin O'Dowd, Matteo Rinaldi","doi":"10.5194/essd-16-2717-2024","DOIUrl":null,"url":null,"abstract":"Abstract. Accurate long-term marine-derived biogenic sulfur aerosol concentrations at high spatial and temporal resolutions are critical for a wide range of studies, including climatology, trend analysis, and model evaluation; this information is also imperative for the accurate investigation of the contribution of marine-derived biogenic sulfur aerosol concentrations to the aerosol burden, for the elucidation of their radiative impacts, and to provide boundary conditions for regional models. By applying machine learning algorithms, we constructed the first publicly available daily gridded dataset of in situ-produced biogenic methanesulfonic acid (MSA) and non-sea-salt sulfate (nss-SO4=) concentrations covering the North Atlantic. The dataset is of high spatial resolution (0.25° × 0.25°) and spans 25 years (1998–2022), far exceeding what observations alone could achieve both spatially and temporally. The machine learning models were generated by combining in situ observations of sulfur aerosol data from Mace Head Atmospheric Research Station, located on the west coast of Ireland, and from the North Atlantic Aerosols and Marine Ecosystems Study (NAAMES) cruises in the northwestern Atlantic with the constructed sea-to-air dimethylsulfide flux (FDMS) and ECMWF ERA5 reanalysis datasets. To determine the optimal method for regression, we employed five machine learning model types: support vector machines, decision tree, regression ensemble, Gaussian process regression, and artificial neural networks. A comparison of the mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R2) revealed that Gaussian process regression (GPR) was the most effective algorithm, outperforming the other models with respect to simulating the biogenic MSA and nss-SO4= concentrations. For predicting daily MSA (nss-SO4=), GPR displayed the highest R2 value of 0.86 (0.72) and the lowest MAE of 0.014 (0.10) µg m−3. GPR partial dependence analysis suggests that the relationships between predictors and MSA and nss-SO4= concentrations are complex rather than linear. Using the GPR algorithm, we produced a high-resolution daily dataset of in situ-produced biogenic MSA and nss-SO4= sea-level concentrations over the North Atlantic, which we named “In-situ Produced Biogenic Methanesulfonic Acid and Sulfate over the North Atlantic” (IPB-MSA&SO4). The obtained IPB-MSA&SO4 data allowed us to analyze the spatiotemporal patterns of MSA and nss-SO4= as well as the ratio between them (MSA:nss-SO4=). A comparison with the existing Copernicus Atmosphere Monitoring Service ECMWF Atmospheric Composition Reanalysis 4 (CAMS-EAC4) reanalysis suggested that our high-resolution dataset reproduces the spatial and temporal patterns of the biogenic sulfur aerosol concentration with high accuracy and has high consistency with independent measurements in the Atlantic Ocean. IPB-MSA&SO4 is publicly available at https://doi.org/10.17632/j8bzd5dvpx.1 (Mansour et al., 2023b).","PeriodicalId":48747,"journal":{"name":"Earth System Science Data","volume":"27 1","pages":""},"PeriodicalIF":11.2000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Earth System Science Data","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.5194/essd-16-2717-2024","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract. Accurate long-term marine-derived biogenic sulfur aerosol concentrations at high spatial and temporal resolutions are critical for a wide range of studies, including climatology, trend analysis, and model evaluation; this information is also imperative for the accurate investigation of the contribution of marine-derived biogenic sulfur aerosol concentrations to the aerosol burden, for the elucidation of their radiative impacts, and to provide boundary conditions for regional models. By applying machine learning algorithms, we constructed the first publicly available daily gridded dataset of in situ-produced biogenic methanesulfonic acid (MSA) and non-sea-salt sulfate (nss-SO4=) concentrations covering the North Atlantic. The dataset is of high spatial resolution (0.25° × 0.25°) and spans 25 years (1998–2022), far exceeding what observations alone could achieve both spatially and temporally. The machine learning models were generated by combining in situ observations of sulfur aerosol data from Mace Head Atmospheric Research Station, located on the west coast of Ireland, and from the North Atlantic Aerosols and Marine Ecosystems Study (NAAMES) cruises in the northwestern Atlantic with the constructed sea-to-air dimethylsulfide flux (FDMS) and ECMWF ERA5 reanalysis datasets. To determine the optimal method for regression, we employed five machine learning model types: support vector machines, decision tree, regression ensemble, Gaussian process regression, and artificial neural networks. A comparison of the mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R2) revealed that Gaussian process regression (GPR) was the most effective algorithm, outperforming the other models with respect to simulating the biogenic MSA and nss-SO4= concentrations. For predicting daily MSA (nss-SO4=), GPR displayed the highest R2 value of 0.86 (0.72) and the lowest MAE of 0.014 (0.10) µg m−3. GPR partial dependence analysis suggests that the relationships between predictors and MSA and nss-SO4= concentrations are complex rather than linear. Using the GPR algorithm, we produced a high-resolution daily dataset of in situ-produced biogenic MSA and nss-SO4= sea-level concentrations over the North Atlantic, which we named “In-situ Produced Biogenic Methanesulfonic Acid and Sulfate over the North Atlantic” (IPB-MSA&SO4). The obtained IPB-MSA&SO4 data allowed us to analyze the spatiotemporal patterns of MSA and nss-SO4= as well as the ratio between them (MSA:nss-SO4=). A comparison with the existing Copernicus Atmosphere Monitoring Service ECMWF Atmospheric Composition Reanalysis 4 (CAMS-EAC4) reanalysis suggested that our high-resolution dataset reproduces the spatial and temporal patterns of the biogenic sulfur aerosol concentration with high accuracy and has high consistency with independent measurements in the Atlantic Ocean. IPB-MSA&SO4 is publicly available at https://doi.org/10.17632/j8bzd5dvpx.1 (Mansour et al., 2023b).
Earth System Science DataGEOSCIENCES, MULTIDISCIPLINARYMETEOROLOGY-METEOROLOGY & ATMOSPHERIC SCIENCES
CiteScore
18.00
自引率
5.30%
发文量
231
审稿时长
35 weeks
期刊介绍:
Earth System Science Data (ESSD) is an international, interdisciplinary journal that publishes articles on original research data in order to promote the reuse of high-quality data in the field of Earth system sciences. The journal welcomes submissions of original data or data collections that meet the required quality standards and have the potential to contribute to the goals of the journal. It includes sections dedicated to regular-length articles, brief communications (such as updates to existing data sets), commentaries, review articles, and special issues. ESSD is abstracted and indexed in several databases, including Science Citation Index Expanded, Current Contents/PCE, Scopus, ADS, CLOCKSS, CNKI, DOAJ, EBSCO, Gale/Cengage, GoOA (CAS), and Google Scholar, among others.