利用叠加式机器学习模型加强空气质量指数数据中的异常值检测

IF 1.8 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Engineering reports : open access Pub Date : 2024-05-30 DOI:10.1002/eng2.12936
Abdoul Aziz Diallo, Lawrence Nderu, Bonface Miya Malenje, Gideon Mutie Kikuvi
{"title":"利用叠加式机器学习模型加强空气质量指数数据中的异常值检测","authors":"Abdoul Aziz Diallo,&nbsp;Lawrence Nderu,&nbsp;Bonface Miya Malenje,&nbsp;Gideon Mutie Kikuvi","doi":"10.1002/eng2.12936","DOIUrl":null,"url":null,"abstract":"<p>The air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far-reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter (<span></span><math>\n <semantics>\n <mrow>\n <mi>μ</mi>\n </mrow>\n <annotation>$$ \\upmu $$</annotation>\n </semantics></math>g/m<span></span><math>\n <semantics>\n <mrow>\n <msup>\n <mrow>\n <mo> </mo>\n </mrow>\n <mrow>\n <mn>3</mn>\n </mrow>\n </msup>\n </mrow>\n <annotation>$$ {}^3 $$</annotation>\n </semantics></math>). The dataset is then preprocessed by cleaning and normalizing it before using K-means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K-means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K-nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1-score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1-score of 0.99, 0.99, 0.97, and 0.99, respectively.</p>","PeriodicalId":72922,"journal":{"name":"Engineering reports : open access","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/eng2.12936","citationCount":"0","resultStr":"{\"title\":\"Enhancing outlier detection in air quality index data using a stacked machine learning model\",\"authors\":\"Abdoul Aziz Diallo,&nbsp;Lawrence Nderu,&nbsp;Bonface Miya Malenje,&nbsp;Gideon Mutie Kikuvi\",\"doi\":\"10.1002/eng2.12936\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far-reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter (<span></span><math>\\n <semantics>\\n <mrow>\\n <mi>μ</mi>\\n </mrow>\\n <annotation>$$ \\\\upmu $$</annotation>\\n </semantics></math>g/m<span></span><math>\\n <semantics>\\n <mrow>\\n <msup>\\n <mrow>\\n <mo> </mo>\\n </mrow>\\n <mrow>\\n <mn>3</mn>\\n </mrow>\\n </msup>\\n </mrow>\\n <annotation>$$ {}^3 $$</annotation>\\n </semantics></math>). The dataset is then preprocessed by cleaning and normalizing it before using K-means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K-means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K-nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1-score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1-score of 0.99, 0.99, 0.97, and 0.99, respectively.</p>\",\"PeriodicalId\":72922,\"journal\":{\"name\":\"Engineering reports : open access\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/eng2.12936\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering reports : open access\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/eng2.12936\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering reports : open access","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/eng2.12936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

空气质量指数(AQI)是评估不同地点和时间跨度空气质量的常用指标。与其他环境数据集类似,空气质量指数数据也会显示出明显偏离正常值的离群数据点,表示空气质量异常良好或异常恶劣的情况。这对于识别和理解具有深远环境和公共健康影响的严重污染事件至关重要。本研究利用 2014 年 1 月 1 日至 2021 年 1 月 31 日期间每天在中国上海市收集的空气质量数据作为实验数据集。数据集包括每日空气质量指数测量值以及六种污染物浓度:颗粒物(PM2.5 和 PM10)、二氧化硫(SO2)、二氧化氮(NO2)、臭氧(O3)和一氧化碳(CO)。每种污染物的浓度以微克/立方米(μ $ \upmu $$ g/m 3 $$ {}^3 $$)为单位。然后,在使用 K-means 聚类发现不同模式之前,对数据集进行清理和归一化预处理。我们开发了一个包含 K-means 聚类、随机森林(RF)和梯度提升分类器(GBC)的堆叠集合机器学习模型,并将其与决策树、支持向量机、K-近邻和 Naive Bayes 算法进行比较,使用准确率、精确度、召回率和 F1 分数评估其识别异常值的性能。根据准确率、精确度、召回率和 F1 分数(分别为 0.99、0.99、0.97 和 0.99),堆叠模型优于所有其他已建立的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Enhancing outlier detection in air quality index data using a stacked machine learning model

The air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far-reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter ( μ $$ \upmu $$ g/m 3 $$ {}^3 $$ ). The dataset is then preprocessed by cleaning and normalizing it before using K-means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K-means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K-nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1-score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1-score of 0.99, 0.99, 0.97, and 0.99, respectively.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
5.10
自引率
0.00%
发文量
0
审稿时长
19 weeks
期刊最新文献
Issue Information Understanding the Effects of Manufacturing Attributes on Damage Tolerance of Additively Manufactured Parts and Exploring Synergy Among Process-Structure-Properties. A Comprehensive Review Issue Information Correction to “The Proof of Concept of Uninterrupted Push-Pull Electromagnetic Propulsion and Energy Conversion Systems for Drones and Planet Landers” Socio-economic impact of solar cooking technologies on community kitchens under different climate conditions: A review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1