利用噪声数据训练机器学习模型的概率方法

IF 4.8 2区 环境科学与生态学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Environmental Modelling & Software Pub Date : 2024-07-02 DOI:10.1016/j.envsoft.2024.106133
Ayman H. Alzraiee , Richard G. Niswonger
{"title":"利用噪声数据训练机器学习模型的概率方法","authors":"Ayman H. Alzraiee ,&nbsp;Richard G. Niswonger","doi":"10.1016/j.envsoft.2024.106133","DOIUrl":null,"url":null,"abstract":"<div><p>Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.</p></div>","PeriodicalId":310,"journal":{"name":"Environmental Modelling & Software","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1364815224001944/pdfft?md5=e1e87f0b5ef16de980acb3594e5d21d5&pid=1-s2.0-S1364815224001944-main.pdf","citationCount":"0","resultStr":"{\"title\":\"A probabilistic approach to training machine learning models using noisy data\",\"authors\":\"Ayman H. Alzraiee ,&nbsp;Richard G. Niswonger\",\"doi\":\"10.1016/j.envsoft.2024.106133\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.</p></div>\",\"PeriodicalId\":310,\"journal\":{\"name\":\"Environmental Modelling & Software\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2024-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1364815224001944/pdfft?md5=e1e87f0b5ef16de980acb3594e5d21d5&pid=1-s2.0-S1364815224001944-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Modelling & Software\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1364815224001944\",\"RegionNum\":2,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Modelling & Software","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1364815224001944","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

机器学习(ML)模型在环境和水文建模中越来越受欢迎,但它们通常包含由噪声数据(错误或离群数据)导致的不确定性。本文介绍了一种新颖的概率方法,该方法结合了 ML 和马尔可夫链蒙特卡罗模拟,用于:(1)检测可能存在的噪声数据并降低其权重;(2)开发一种能够在模型部署过程中检测噪声数据的方法;以及(3)解释数据点被视为噪声的原因,以帮助启发式地区分异常值和错误数据。新算法认识到,将训练数据拆分为噪声数据和干净数据的方法并不唯一,因此会产生一系列合理的拆分。该算法在不同复杂度的合成基准问题和现实世界的公共供水取水数据集中成功检测出了噪声数据。该算法具有通用性和灵活性,适用于广泛的水文和环境学科。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A probabilistic approach to training machine learning models using noisy data

Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Environmental Modelling & Software
Environmental Modelling & Software 工程技术-工程:环境
CiteScore
9.30
自引率
8.20%
发文量
241
审稿时长
60 days
期刊介绍: Environmental Modelling & Software publishes contributions, in the form of research articles, reviews and short communications, on recent advances in environmental modelling and/or software. The aim is to improve our capacity to represent, understand, predict or manage the behaviour of environmental systems at all practical scales, and to communicate those improvements to a wide scientific and professional audience.
期刊最新文献
A coordination attention residual U-Net model for enhanced short and mid-term sea surface temperature prediction An R package to partition observation data used for model development and evaluation to achieve model generalizability Dynamics of real-time forecasting failure and recovery due to data gaps: A study using EnKF-based assimilation with the Lorenz model Identification of pedestrian submerged parts in urban flooding based on images and deep learning A conceptual data modeling framework with four levels of abstraction for environmental information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1