Yang Qiu , Aiguo Zhou , Hanxiang Xiong , Defang Zhang , Cheng Su , Shizheng Zhou , Lin Go , Chi Yang , Hao Cui , Wei Fan , Yao Yu , Fawang Zhang , Chuanming Ma
{"title":"Probabilistic mapping of imbalanced data for groundwater contamination using classification algorithms: Performance and reliability","authors":"Yang Qiu , Aiguo Zhou , Hanxiang Xiong , Defang Zhang , Cheng Su , Shizheng Zhou , Lin Go , Chi Yang , Hao Cui , Wei Fan , Yao Yu , Fawang Zhang , Chuanming Ma","doi":"10.1016/j.gsd.2024.101393","DOIUrl":null,"url":null,"abstract":"<div><div>The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.</div></div>","PeriodicalId":37879,"journal":{"name":"Groundwater for Sustainable Development","volume":"28 ","pages":"Article 101393"},"PeriodicalIF":4.9000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Groundwater for Sustainable Development","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352801X24003163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.
期刊介绍:
Groundwater for Sustainable Development is directed to different stakeholders and professionals, including government and non-governmental organizations, international funding agencies, universities, public water institutions, public health and other public/private sector professionals, and other relevant institutions. It is aimed at professionals, academics and students in the fields of disciplines such as: groundwater and its connection to surface hydrology and environment, soil sciences, engineering, ecology, microbiology, atmospheric sciences, analytical chemistry, hydro-engineering, water technology, environmental ethics, economics, public health, policy, as well as social sciences, legal disciplines, or any other area connected with water issues. The objectives of this journal are to facilitate: • The improvement of effective and sustainable management of water resources across the globe. • The improvement of human access to groundwater resources in adequate quantity and good quality. • The meeting of the increasing demand for drinking and irrigation water needed for food security to contribute to a social and economically sound human development. • The creation of a global inter- and multidisciplinary platform and forum to improve our understanding of groundwater resources and to advocate their effective and sustainable management and protection against contamination. • Interdisciplinary information exchange and to stimulate scientific research in the fields of groundwater related sciences and social and health sciences required to achieve the United Nations Millennium Development Goals for sustainable development.