{"title":"An outlier detection framework for Air Quality Index prediction using linear and ensemble models","authors":"Pradeep Kumar Dongre , Viral Patel , Upendra Bhoi , Nilesh N. Maltare","doi":"10.1016/j.dajour.2025.100546","DOIUrl":null,"url":null,"abstract":"<div><div>The Air Quality Index (AQI) is a key indicator for assessing air quality and its associated health impacts. Accurate AQI calculations are crucial for reliable air quality assessments, but outliers in air quality data can distort these calculations, leading to inaccurate predictions. This paper presents a comprehensive framework for air quality prediction that integrates multiple outlier detection methods with machine learning models, focusing on enhancing the accuracy and robustness of predictions. The study investigates various outlier detection techniques, including the Interquartile Range (IQR), robust Z-score, and Mahalanobis distance, and evaluates their impact when integrated into machine learning models. Unlike traditional approaches that remove outliers without considering seasonal effects, this research proposes retaining extreme data points after seasonal validation to improve model generalization and prediction accuracy for unseen data. The framework is evaluated using a dataset from Jaipur city, testing multiple machine learning models, including linear regression, ensemble methods, and K-Nearest Neighbor (KNN) regression. Results show that the integrated framework significantly improves model performance, with the Extra Trees Regressor achieving the best results (MAE = 11.9161, RMSE = 16.1660, and <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.8884) after refinement, compared to baseline performance (MAE = 12.6765, RMSE = 17.8452, and <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.8737). This study demonstrates the empirical effectiveness of the proposed framework and provides practical guidelines for air quality prediction in real-world applications.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"14 ","pages":"Article 100546"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Air Quality Index (AQI) is a key indicator for assessing air quality and its associated health impacts. Accurate AQI calculations are crucial for reliable air quality assessments, but outliers in air quality data can distort these calculations, leading to inaccurate predictions. This paper presents a comprehensive framework for air quality prediction that integrates multiple outlier detection methods with machine learning models, focusing on enhancing the accuracy and robustness of predictions. The study investigates various outlier detection techniques, including the Interquartile Range (IQR), robust Z-score, and Mahalanobis distance, and evaluates their impact when integrated into machine learning models. Unlike traditional approaches that remove outliers without considering seasonal effects, this research proposes retaining extreme data points after seasonal validation to improve model generalization and prediction accuracy for unseen data. The framework is evaluated using a dataset from Jaipur city, testing multiple machine learning models, including linear regression, ensemble methods, and K-Nearest Neighbor (KNN) regression. Results show that the integrated framework significantly improves model performance, with the Extra Trees Regressor achieving the best results (MAE = 11.9161, RMSE = 16.1660, and = 0.8884) after refinement, compared to baseline performance (MAE = 12.6765, RMSE = 17.8452, and = 0.8737). This study demonstrates the empirical effectiveness of the proposed framework and provides practical guidelines for air quality prediction in real-world applications.