机器学习模型用于预测饮用水中的化学污染物:前景、挑战和机遇。

IF 7.4 2区 医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Current Environmental Health Reports Pub Date : 2023-03-01 DOI:10.1007/s40572-022-00389-x
Xindi C Hu, Mona Dai, Jennifer M Sun, Elsie M Sunderland
{"title":"机器学习模型用于预测饮用水中的化学污染物:前景、挑战和机遇。","authors":"Xindi C Hu,&nbsp;Mona Dai,&nbsp;Jennifer M Sun,&nbsp;Elsie M Sunderland","doi":"10.1007/s40572-022-00389-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose of review: </strong>This review aims to better understand the utility of machine learning algorithms for predicting spatial patterns of contaminants in the United States (U.S.) drinking water.</p><p><strong>Recent findings: </strong>We found 27 U.S. drinking water studies in the past ten years that used machine learning algorithms to predict water quality. Most studies (42%) developed random forest classification models for groundwater. Continuous models show low predictive power, suggesting that larger datasets and additional predictors are needed. Categorical/classification models for arsenic and nitrate that predict exceedances of pollution thresholds are most common in the literature because of good national scale data coverage and priority as environmental health concerns. Most groundwater data used to develop models were obtained from the United States Geological Survey (USGS) National Water Information System (NWIS). Predictors were similar across contaminants but challenges are posed by the lack of a standard methodology for imputation, pre-processing, and differing availability of data across regions. We reviewed 27 articles that focused on seven drinking water contaminants. Good performance metrics were reported for binary models that classified chemical concentrations above a threshold value by finding significant predictors. Classification models are especially useful for assisting in the design of sampling efforts by identifying high-risk areas. Only a few studies have developed continuous models and obtaining good predictive performance for such models is still challenging. Improving continuous models is important for potential future use in epidemiological studies to supplement data gaps in exposure assessments for drinking water contaminants. While significant progress has been made over the past decade, methodological advances are still needed for selecting appropriate model performance metrics and accounting for spatial autocorrelations in data. Finally, improved infrastructure for code and data sharing would spearhead more rapid advances in machine-learning models for drinking water quality.</p>","PeriodicalId":10775,"journal":{"name":"Current Environmental Health Reports","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9883334/pdf/","citationCount":"5","resultStr":"{\"title\":\"The Utility of Machine Learning Models for Predicting Chemical Contaminants in Drinking Water: Promise, Challenges, and Opportunities.\",\"authors\":\"Xindi C Hu,&nbsp;Mona Dai,&nbsp;Jennifer M Sun,&nbsp;Elsie M Sunderland\",\"doi\":\"10.1007/s40572-022-00389-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose of review: </strong>This review aims to better understand the utility of machine learning algorithms for predicting spatial patterns of contaminants in the United States (U.S.) drinking water.</p><p><strong>Recent findings: </strong>We found 27 U.S. drinking water studies in the past ten years that used machine learning algorithms to predict water quality. Most studies (42%) developed random forest classification models for groundwater. Continuous models show low predictive power, suggesting that larger datasets and additional predictors are needed. Categorical/classification models for arsenic and nitrate that predict exceedances of pollution thresholds are most common in the literature because of good national scale data coverage and priority as environmental health concerns. Most groundwater data used to develop models were obtained from the United States Geological Survey (USGS) National Water Information System (NWIS). Predictors were similar across contaminants but challenges are posed by the lack of a standard methodology for imputation, pre-processing, and differing availability of data across regions. We reviewed 27 articles that focused on seven drinking water contaminants. Good performance metrics were reported for binary models that classified chemical concentrations above a threshold value by finding significant predictors. Classification models are especially useful for assisting in the design of sampling efforts by identifying high-risk areas. Only a few studies have developed continuous models and obtaining good predictive performance for such models is still challenging. Improving continuous models is important for potential future use in epidemiological studies to supplement data gaps in exposure assessments for drinking water contaminants. While significant progress has been made over the past decade, methodological advances are still needed for selecting appropriate model performance metrics and accounting for spatial autocorrelations in data. Finally, improved infrastructure for code and data sharing would spearhead more rapid advances in machine-learning models for drinking water quality.</p>\",\"PeriodicalId\":10775,\"journal\":{\"name\":\"Current Environmental Health Reports\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9883334/pdf/\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Environmental Health Reports\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s40572-022-00389-x\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Environmental Health Reports","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40572-022-00389-x","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 5

摘要

综述目的:本综述旨在更好地理解机器学习算法在预测美国饮用水中污染物空间格局方面的应用。最近的发现:我们发现,在过去十年中,美国有27项饮用水研究使用机器学习算法来预测水质。大多数研究(42%)建立了地下水随机森林分类模型。连续模型显示出较低的预测能力,这表明需要更大的数据集和额外的预测器。预测超过污染阈值的砷和硝酸盐的分类/分类模型在文献中最为常见,因为它们具有良好的国家规模数据覆盖范围和作为环境健康问题的优先事项。用于开发模型的大多数地下水数据来自美国地质调查局(USGS)的国家水信息系统(NWIS)。不同污染物的预测指标相似,但由于缺乏标准的估算方法、预处理方法以及不同地区数据的可用性存在差异,这些都构成了挑战。我们回顾了27篇关于7种饮用水污染物的文章。通过发现显著的预测因子,二元模型将化学物质浓度分类到阈值以上,报告了良好的性能指标。分类模型对于通过识别高风险区域来协助设计抽样工作特别有用。只有少数研究建立了连续模型,并且对这些模型获得良好的预测性能仍然具有挑战性。改进连续模型对于将来可能在流行病学研究中使用很重要,以补充饮用水污染物暴露评估中的数据空白。虽然在过去十年中取得了重大进展,但仍然需要在方法上取得进展,以选择适当的模型性能指标和计算数据中的空间自相关性。最后,代码和数据共享基础设施的改善将引领饮用水质量机器学习模型的更快发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The Utility of Machine Learning Models for Predicting Chemical Contaminants in Drinking Water: Promise, Challenges, and Opportunities.

Purpose of review: This review aims to better understand the utility of machine learning algorithms for predicting spatial patterns of contaminants in the United States (U.S.) drinking water.

Recent findings: We found 27 U.S. drinking water studies in the past ten years that used machine learning algorithms to predict water quality. Most studies (42%) developed random forest classification models for groundwater. Continuous models show low predictive power, suggesting that larger datasets and additional predictors are needed. Categorical/classification models for arsenic and nitrate that predict exceedances of pollution thresholds are most common in the literature because of good national scale data coverage and priority as environmental health concerns. Most groundwater data used to develop models were obtained from the United States Geological Survey (USGS) National Water Information System (NWIS). Predictors were similar across contaminants but challenges are posed by the lack of a standard methodology for imputation, pre-processing, and differing availability of data across regions. We reviewed 27 articles that focused on seven drinking water contaminants. Good performance metrics were reported for binary models that classified chemical concentrations above a threshold value by finding significant predictors. Classification models are especially useful for assisting in the design of sampling efforts by identifying high-risk areas. Only a few studies have developed continuous models and obtaining good predictive performance for such models is still challenging. Improving continuous models is important for potential future use in epidemiological studies to supplement data gaps in exposure assessments for drinking water contaminants. While significant progress has been made over the past decade, methodological advances are still needed for selecting appropriate model performance metrics and accounting for spatial autocorrelations in data. Finally, improved infrastructure for code and data sharing would spearhead more rapid advances in machine-learning models for drinking water quality.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
13.60
自引率
1.30%
发文量
47
期刊介绍: Current Environmental Health Reports provides up-to-date expert reviews in environmental health. The goal is to evaluate and synthesize original research in all disciplines relevant for environmental health sciences, including basic research, clinical research, epidemiology, and environmental policy.
期刊最新文献
Portable x-ray fluorescence for bone lead measurement: Current approaches and future directions. Environmental and Human Health Problems Associated with Hospital Wastewater Management in Zimbabwe. Health Effects of Occupational and Environmental Exposures to Nuclear Power Plants: A Meta-Analysis and Meta-Regression. Metabolic Perturbations Associated with both PFAS Exposure and Perinatal/Antenatal Depression in Pregnant Individuals: A Meet-in-the-Middle Scoping Review. Tear Fluid as a Matrix for Biomonitoring Environmental and Chemical Exposures.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1