Probabilistic mapping of imbalanced data for groundwater contamination using classification algorithms: Performance and reliability

IF 4.9 Q2 ENGINEERING, ENVIRONMENTAL Groundwater for Sustainable Development Pub Date : 2025-02-01 Epub Date: 2024-12-11 DOI:10.1016/j.gsd.2024.101393
Yang Qiu , Aiguo Zhou , Hanxiang Xiong , Defang Zhang , Cheng Su , Shizheng Zhou , Lin Go , Chi Yang , Hao Cui , Wei Fan , Yao Yu , Fawang Zhang , Chuanming Ma
{"title":"Probabilistic mapping of imbalanced data for groundwater contamination using classification algorithms: Performance and reliability","authors":"Yang Qiu ,&nbsp;Aiguo Zhou ,&nbsp;Hanxiang Xiong ,&nbsp;Defang Zhang ,&nbsp;Cheng Su ,&nbsp;Shizheng Zhou ,&nbsp;Lin Go ,&nbsp;Chi Yang ,&nbsp;Hao Cui ,&nbsp;Wei Fan ,&nbsp;Yao Yu ,&nbsp;Fawang Zhang ,&nbsp;Chuanming Ma","doi":"10.1016/j.gsd.2024.101393","DOIUrl":null,"url":null,"abstract":"<div><div>The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.</div></div>","PeriodicalId":37879,"journal":{"name":"Groundwater for Sustainable Development","volume":"28 ","pages":"Article 101393"},"PeriodicalIF":4.9000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Groundwater for Sustainable Development","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352801X24003163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/11 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用分类算法的地下水污染不平衡数据的概率映射:性能和可靠性
地下水污染的概率映射是地下水可持续管理的重要基础。然而,地下水数据往往表现出不平衡,这给精确可靠的概率制图带来了挑战。本研究以江汉平原为研究对象,利用一个小型的、不平衡的数据集(n = 246, Class0/Class1 = 0.84/0.16),评估了各种采样和集成技术的性能和可靠性。概率图显示出显著的空间变异性,高概率区域集中在西部(宜昌)、东部(武汉)和北部(汉江北岸),低概率区域集中在中部和南部。过度抽样方法通过保持类平衡和提高映射结果的可靠性而优于其他方法。过采样方法的高-极高概率区域范围为15.5% ~ 18.9%,极低-低概率区域范围较大(60.5% ~ 66.3%)。相比之下,欠采样和集合方法显示出较大的高-极高概率区域(34.0% ~ 53.1%)和较小的极低-低概率区域(21.6% ~ 46.3%)。与其他方法相比,过采样方法具有更高的F1分数(0.27-0.33)和精度(0.375-0.43)。SHAP分析表明,过度采样方法在保持信息完整性的同时平衡了数据集,增强了制图结果的可信度。相反,集成方法在统计分析中面临挑战,阻碍了可解释性。我们强烈建议,在进行地下水污染的概率映射时,必须充分考虑数据集的不平衡,而不是仅仅依赖于AUC和OA等指标。对于类似于本研究的小型数据集,SMOTE和ADASYN作为推荐的采样方法,不仅可以获得高精度的制图结果,而且可以保证可解释性,从而为地下水的可持续管理提供更可靠的依据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Groundwater for Sustainable Development
Groundwater for Sustainable Development Social Sciences-Geography, Planning and Development
CiteScore
11.50
自引率
10.20%
发文量
152
期刊介绍: Groundwater for Sustainable Development is directed to different stakeholders and professionals, including government and non-governmental organizations, international funding agencies, universities, public water institutions, public health and other public/private sector professionals, and other relevant institutions. It is aimed at professionals, academics and students in the fields of disciplines such as: groundwater and its connection to surface hydrology and environment, soil sciences, engineering, ecology, microbiology, atmospheric sciences, analytical chemistry, hydro-engineering, water technology, environmental ethics, economics, public health, policy, as well as social sciences, legal disciplines, or any other area connected with water issues. The objectives of this journal are to facilitate: • The improvement of effective and sustainable management of water resources across the globe. • The improvement of human access to groundwater resources in adequate quantity and good quality. • The meeting of the increasing demand for drinking and irrigation water needed for food security to contribute to a social and economically sound human development. • The creation of a global inter- and multidisciplinary platform and forum to improve our understanding of groundwater resources and to advocate their effective and sustainable management and protection against contamination. • Interdisciplinary information exchange and to stimulate scientific research in the fields of groundwater related sciences and social and health sciences required to achieve the United Nations Millennium Development Goals for sustainable development.
期刊最新文献
Synergizing machine learning and hydrological model to enhance water availability and demand forecasting in Godawari, Nepal The role of secondary data in estimating groundwater levels in the Iberian Peninsula Machine learning-based modeling of groundwater recharge under three climate change scenarios in the Densu Basin of Ghana, West Africa Benzene contamination from operating petrochemical enterprises alters in-situ microbial communities and metabolic potential in groundwater Hydrochemical and isotopic approaches to quality of coastal plain water resources for management purposes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1