The Utility of Machine Learning Models for Predicting Chemical Contaminants in Drinking Water: Promise, Challenges, and Opportunities.

IF 7.4 2区医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Current Environmental Health Reports Pub Date : 2023-03-01 Epub Date: 2022-12-17 DOI:10.1007/s40572-022-00389-x

Xindi C Hu, Mona Dai, Jennifer M Sun, Elsie M Sunderland

{"title":"The Utility of Machine Learning Models for Predicting Chemical Contaminants in Drinking Water: Promise, Challenges, and Opportunities.","authors":"Xindi C Hu, Mona Dai, Jennifer M Sun, Elsie M Sunderland","doi":"10.1007/s40572-022-00389-x","DOIUrl":null,"url":null,"abstract":"Purpose of review: This review aims to better understand the utility of machine learning algorithms for predicting spatial patterns of contaminants in the United States (U.S.) drinking water.Recent findings: We found 27 U.S. drinking water studies in the past ten years that used machine learning algorithms to predict water quality. Most studies (42%) developed random forest classification models for groundwater. Continuous models show low predictive power, suggesting that larger datasets and additional predictors are needed. Categorical/classification models for arsenic and nitrate that predict exceedances of pollution thresholds are most common in the literature because of good national scale data coverage and priority as environmental health concerns. Most groundwater data used to develop models were obtained from the United States Geological Survey (USGS) National Water Information System (NWIS). Predictors were similar across contaminants but challenges are posed by the lack of a standard methodology for imputation, pre-processing, and differing availability of data across regions. We reviewed 27 articles that focused on seven drinking water contaminants. Good performance metrics were reported for binary models that classified chemical concentrations above a threshold value by finding significant predictors. Classification models are especially useful for assisting in the design of sampling efforts by identifying high-risk areas. Only a few studies have developed continuous models and obtaining good predictive performance for such models is still challenging. Improving continuous models is important for potential future use in epidemiological studies to supplement data gaps in exposure assessments for drinking water contaminants. While significant progress has been made over the past decade, methodological advances are still needed for selecting appropriate model performance metrics and accounting for spatial autocorrelations in data. Finally, improved infrastructure for code and data sharing would spearhead more rapid advances in machine-learning models for drinking water quality.","PeriodicalId":10775,"journal":{"name":"Current Environmental Health Reports","volume":"10 1","pages":"45-60"},"PeriodicalIF":7.4000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9883334/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Environmental Health Reports","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40572-022-00389-x","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/12/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose of review: This review aims to better understand the utility of machine learning algorithms for predicting spatial patterns of contaminants in the United States (U.S.) drinking water.

Recent findings: We found 27 U.S. drinking water studies in the past ten years that used machine learning algorithms to predict water quality. Most studies (42%) developed random forest classification models for groundwater. Continuous models show low predictive power, suggesting that larger datasets and additional predictors are needed. Categorical/classification models for arsenic and nitrate that predict exceedances of pollution thresholds are most common in the literature because of good national scale data coverage and priority as environmental health concerns. Most groundwater data used to develop models were obtained from the United States Geological Survey (USGS) National Water Information System (NWIS). Predictors were similar across contaminants but challenges are posed by the lack of a standard methodology for imputation, pre-processing, and differing availability of data across regions. We reviewed 27 articles that focused on seven drinking water contaminants. Good performance metrics were reported for binary models that classified chemical concentrations above a threshold value by finding significant predictors. Classification models are especially useful for assisting in the design of sampling efforts by identifying high-risk areas. Only a few studies have developed continuous models and obtaining good predictive performance for such models is still challenging. Improving continuous models is important for potential future use in epidemiological studies to supplement data gaps in exposure assessments for drinking water contaminants. While significant progress has been made over the past decade, methodological advances are still needed for selecting appropriate model performance metrics and accounting for spatial autocorrelations in data. Finally, improved infrastructure for code and data sharing would spearhead more rapid advances in machine-learning models for drinking water quality.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

机器学习模型在预测饮用水中化学污染物方面的实用性：前景、挑战与机遇。

综述的目的：本综述旨在更好地了解机器学习算法在预测美国饮用水污染物空间模式方面的实用性：我们发现，在过去十年中，有 27 项美国饮用水研究使用了机器学习算法来预测水质。大多数研究（42%）为地下水开发了随机森林分类模型。连续模型显示出较低的预测能力，这表明需要更大的数据集和更多的预测因子。文献中最常见的是预测污染阈值超标的砷和硝酸盐分类/分类模型，因为它们具有良好的全国范围数据覆盖率，并且是环境健康问题的优先考虑因素。用于开发模型的大多数地下水数据都来自美国地质调查局 (USGS) 的国家水信息系统 (NWIS)。各种污染物的预测因子相似，但由于缺乏标准的估算和预处理方法，以及各地区数据的可用性不同，因此面临着挑战。我们审查了 27 篇文章，重点关注七种饮用水污染物。二元模型通过找到重要的预测因子，对超过阈值的化学物质浓度进行分类，并报告了良好的性能指标。分类模型通过确定高风险区域，对协助设计采样工作特别有用。只有少数研究开发了连续模型，要使此类模型获得良好的预测性能仍具有挑战性。改进连续模型对于未来可能用于流行病学研究以补充饮用水污染物暴露评估中的数据缺口非常重要。虽然在过去十年中取得了重大进展，但在选择适当的模型性能指标和考虑数据的空间自相关性方面，仍需要在方法上取得进步。最后，代码和数据共享基础设施的改善将推动饮用水质量机器学习模型的更快发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Current Environmental Health Reports Multiple-

CiteScore

13.60

自引率

1.30%

发文量

期刊介绍： Current Environmental Health Reports provides up-to-date expert reviews in environmental health. The goal is to evaluate and synthesize original research in all disciplines relevant for environmental health sciences, including basic research, clinical research, epidemiology, and environmental policy.