{"title":"Prediction of soil arsenic concentration in European soils: A dimensionality reduction and ensemble learning approach","authors":"Mohammad Sadegh Barkhordari , Chongchong Qi","doi":"10.1016/j.hazadv.2025.100604","DOIUrl":null,"url":null,"abstract":"<div><div>Arsenic contamination in soils poses significant risks to human health and the environment, necessitating accurate prediction methods to support effective mitigation strategies. This study addresses critical gaps in previous research, including multicollinearity among predictor variables, limited consideration of anthropogenic factors, and insufficient use of dimensionality reduction techniques. Principal Component Analysis (PCA) was employed for feature extraction, and six ensemble learning models were compared to enhance prediction accuracy for arsenic concentrations in European soils. Key environmental, chemical, physical, and anthropogenic factors were incorporated. Random Forest emerged as the top-performing model, achieving a mean squared error of 0.71 and a prediction accuracy of 89 % on test data. The results highlight the significant role of anthropogenic factors—particularly agricultural practices—in influencing arsenic levels, alongside chemical properties like phosphorus concentration and soil pH. The study demonstrates that advanced feature engineering, including PCA, can address multicollinearity while improving machine learning model performance. The findings provide critical insights for environmental risk assessment and policymaking, emphasizing the need for targeted interventions in regions with high anthropogenic activity. By combining robust data preprocessing and state-of-the-art ensemble learning techniques, this research offers a scalable and effective framework for predicting soil contamination and guiding remediation efforts across diverse geographic settings.</div></div>","PeriodicalId":73763,"journal":{"name":"Journal of hazardous materials advances","volume":"17 ","pages":"Article 100604"},"PeriodicalIF":7.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of hazardous materials advances","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772416625000166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/15 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Arsenic contamination in soils poses significant risks to human health and the environment, necessitating accurate prediction methods to support effective mitigation strategies. This study addresses critical gaps in previous research, including multicollinearity among predictor variables, limited consideration of anthropogenic factors, and insufficient use of dimensionality reduction techniques. Principal Component Analysis (PCA) was employed for feature extraction, and six ensemble learning models were compared to enhance prediction accuracy for arsenic concentrations in European soils. Key environmental, chemical, physical, and anthropogenic factors were incorporated. Random Forest emerged as the top-performing model, achieving a mean squared error of 0.71 and a prediction accuracy of 89 % on test data. The results highlight the significant role of anthropogenic factors—particularly agricultural practices—in influencing arsenic levels, alongside chemical properties like phosphorus concentration and soil pH. The study demonstrates that advanced feature engineering, including PCA, can address multicollinearity while improving machine learning model performance. The findings provide critical insights for environmental risk assessment and policymaking, emphasizing the need for targeted interventions in regions with high anthropogenic activity. By combining robust data preprocessing and state-of-the-art ensemble learning techniques, this research offers a scalable and effective framework for predicting soil contamination and guiding remediation efforts across diverse geographic settings.