Application of the Lasso regularisation technique in mitigating overfitting in air quality prediction models.

IF 3.9 2区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES Scientific Reports Pub Date : 2025-01-02 DOI:10.1038/s41598-024-84342-y
Abbas Pak, Abdullah Kaviani Rad, Mohammad Javad Nematollahi, Mohammadreza Mahmoudi
{"title":"Application of the Lasso regularisation technique in mitigating overfitting in air quality prediction models.","authors":"Abbas Pak, Abdullah Kaviani Rad, Mohammad Javad Nematollahi, Mohammadreza Mahmoudi","doi":"10.1038/s41598-024-84342-y","DOIUrl":null,"url":null,"abstract":"<p><p>As a significant global concern, air pollution triggers enormous challenges in public health and ecological sustainability, necessitating the development of precise algorithms to forecast and mitigate its impacts, which has led to the development of many machine learning (ML)-based models for predicting air quality. Meanwhile, overfitting is a prevalent issue with ML algorithms that decreases their efficacy and generalizability. The present investigation, using an extensive collection of data from 16 sensors in Tehran, Iran, from 2013 to 2023, focuses on applying the Least Absolute Shrinkage and Selection Operator (Lasso) regularisation technique to enhance the forecasting precision of ambient air pollutants concentration models, including particulate matter (PM<sub>2.5</sub> and PM<sub>10</sub>), CO, NO<sub>2</sub>, SO<sub>2</sub>, and O<sub>3</sub> while decreasing overfitting. The outputs were compared using the R-squared (R<sup>2</sup>), mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and normalised mean square error (NMSE) indices. Despite the preliminary findings revealing that Lasso dramatically enhances model reliability by decreasing overfitting and determining key attributes, the model's performance in predicting gaseous pollutants against PM remained unsatisfactory (R<sup>2</sup><sub>PM2.5</sub> = 0.80, R<sup>2</sup><sub>PM10</sub> = 0.75, R<sup>2</sup><sub>CO</sub> = 0.45, R<sup>2</sup><sub>NO2</sub> = 0.55, R<sup>2</sup><sub>SO2</sub> = 0.65, and R<sup>2</sup><sub>O3</sub> = 0.35). The minimal degree of missing data presumably explained the strong performance of the PM model, while the high dynamism of gases and their chemical interactions, in conjunction with the inherent characteristics of the model, were the primary factors contributing to the poor performance of the model. Simultaneously, the successful implementation of the Lasso regularisation approach in mitigating overfitting and selecting more important features makes it highly suggested for application in air quality forecasting models.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"547"},"PeriodicalIF":3.9000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696743/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-024-84342-y","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

As a significant global concern, air pollution triggers enormous challenges in public health and ecological sustainability, necessitating the development of precise algorithms to forecast and mitigate its impacts, which has led to the development of many machine learning (ML)-based models for predicting air quality. Meanwhile, overfitting is a prevalent issue with ML algorithms that decreases their efficacy and generalizability. The present investigation, using an extensive collection of data from 16 sensors in Tehran, Iran, from 2013 to 2023, focuses on applying the Least Absolute Shrinkage and Selection Operator (Lasso) regularisation technique to enhance the forecasting precision of ambient air pollutants concentration models, including particulate matter (PM2.5 and PM10), CO, NO2, SO2, and O3 while decreasing overfitting. The outputs were compared using the R-squared (R2), mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and normalised mean square error (NMSE) indices. Despite the preliminary findings revealing that Lasso dramatically enhances model reliability by decreasing overfitting and determining key attributes, the model's performance in predicting gaseous pollutants against PM remained unsatisfactory (R2PM2.5 = 0.80, R2PM10 = 0.75, R2CO = 0.45, R2NO2 = 0.55, R2SO2 = 0.65, and R2O3 = 0.35). The minimal degree of missing data presumably explained the strong performance of the PM model, while the high dynamism of gases and their chemical interactions, in conjunction with the inherent characteristics of the model, were the primary factors contributing to the poor performance of the model. Simultaneously, the successful implementation of the Lasso regularisation approach in mitigating overfitting and selecting more important features makes it highly suggested for application in air quality forecasting models.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Lasso正则化技术在缓解空气质量预测模型过拟合中的应用。
作为一个重大的全球问题,空气污染在公共卫生和生态可持续性方面引发了巨大的挑战,需要开发精确的算法来预测和减轻其影响,这导致了许多基于机器学习(ML)的模型的发展,用于预测空气质量。与此同时,过度拟合是ML算法的一个普遍问题,它降低了它们的有效性和泛化性。本研究使用了2013年至2023年期间来自伊朗德黑兰16个传感器的大量数据,重点应用最小绝对收缩和选择算子(Lasso)正则化技术来提高环境空气污染物浓度模型的预测精度,包括颗粒物(PM2.5和PM10)、CO、NO2、SO2和O3,同时减少过拟合。使用r平方(R2)、平均绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)和归一化均方误差(NMSE)指标对输出进行比较。尽管初步结果显示Lasso通过减少过拟合和确定关键属性显著提高了模型的可靠性,但该模型在预测气体污染物对PM的影响方面的性能仍然令人不满意(R2PM2.5 = 0.80, R2PM10 = 0.75, R2CO = 0.45, R2NO2 = 0.55, R2SO2 = 0.65, R2O3 = 0.35)。数据缺失的最小程度可能解释了PM模型的强大性能,而气体的高动态及其化学相互作用,结合模型的固有特征,是导致模型性能差的主要因素。同时,Lasso正则化方法在减少过拟合和选择更重要特征方面的成功实施,使其在空气质量预测模型中得到了广泛的应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Scientific Reports
Scientific Reports Natural Science Disciplines-
CiteScore
7.50
自引率
4.30%
发文量
19567
审稿时长
3.9 months
期刊介绍: We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.
期刊最新文献
Correlation between human expert macular fluid height assessment and fluid volume quantification in neovascular age-related macular degeneration. Analysis of water-permeable fractured zone in weakly cemented overburden considering rock strain-softening. An embedded deep learning framework for real-time violence detection and alert generation. Numerical study on the performance of a forced ventilation system under hydrogen leakage in an underground hydrogen equipment room. Rigorous construction and classification of solitary-waves and exact soliton configurations in the nonlinear coupled Maccari system.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1