Enhancing outlier detection in air quality index data using a stacked machine learning model

IF 1.8 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Engineering reports : open access Pub Date : 2024-05-30 DOI:10.1002/eng2.12936

Abdoul Aziz Diallo, Lawrence Nderu, Bonface Miya Malenje, Gideon Mutie Kikuvi

{"title":"Enhancing outlier detection in air quality index data using a stacked machine learning model","authors":"Abdoul Aziz Diallo, Lawrence Nderu, Bonface Miya Malenje, Gideon Mutie Kikuvi","doi":"10.1002/eng2.12936","DOIUrl":null,"url":null,"abstract":"The air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far-reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter (<math>\n <semantics>\n <mrow>\n <mi>μ</mi>\n </mrow>\n <annotation>$$ \\upmu $$</annotation>\n </semantics></math>g/m<math>\n <semantics>\n <mrow>\n <msup>\n <mrow>\n <mo> </mo>\n </mrow>\n <mrow>\n <mn>3</mn>\n </mrow>\n </msup>\n </mrow>\n <annotation>$$ {}^3 $$</annotation>\n </semantics></math>). The dataset is then preprocessed by cleaning and normalizing it before using K-means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K-means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K-nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1-score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1-score of 0.99, 0.99, 0.97, and 0.99, respectively.","PeriodicalId":72922,"journal":{"name":"Engineering reports : open access","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/eng2.12936","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering reports : open access","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/eng2.12936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far-reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter ( $μ$ g/m $^{3}$ ). The dataset is then preprocessed by cleaning and normalizing it before using K-means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K-means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K-nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1-score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1-score of 0.99, 0.99, 0.97, and 0.99, respectively.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用叠加式机器学习模型加强空气质量指数数据中的异常值检测

空气质量指数（AQI）是评估不同地点和时间跨度空气质量的常用指标。与其他环境数据集类似，空气质量指数数据也会显示出明显偏离正常值的离群数据点，表示空气质量异常良好或异常恶劣的情况。这对于识别和理解具有深远环境和公共健康影响的严重污染事件至关重要。本研究利用 2014 年 1 月 1 日至 2021 年 1 月 31 日期间每天在中国上海市收集的空气质量数据作为实验数据集。数据集包括每日空气质量指数测量值以及六种污染物浓度：颗粒物（PM2.5 和 PM10）、二氧化硫（SO2）、二氧化氮（NO2）、臭氧（O3）和一氧化碳（CO）。每种污染物的浓度以微克/立方米（μ $ \upmu $$ g/m 3 $$ {}^3 $$）为单位。然后，在使用 K-means 聚类发现不同模式之前，对数据集进行清理和归一化预处理。我们开发了一个包含 K-means 聚类、随机森林（RF）和梯度提升分类器（GBC）的堆叠集合机器学习模型，并将其与决策树、支持向量机、K-近邻和 Naive Bayes 算法进行比较，使用准确率、精确度、召回率和 F1 分数评估其识别异常值的性能。根据准确率、精确度、召回率和 F1 分数（分别为 0.99、0.99、0.97 和 0.99），堆叠模型优于所有其他已建立的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊