{"title":"A probabilistic approach to training machine learning models using noisy data","authors":"Ayman H. Alzraiee , Richard G. Niswonger","doi":"10.1016/j.envsoft.2024.106133","DOIUrl":null,"url":null,"abstract":"<div><p>Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.</p></div>","PeriodicalId":310,"journal":{"name":"Environmental Modelling & Software","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1364815224001944/pdfft?md5=e1e87f0b5ef16de980acb3594e5d21d5&pid=1-s2.0-S1364815224001944-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Modelling & Software","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1364815224001944","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.
机器学习(ML)模型在环境和水文建模中越来越受欢迎,但它们通常包含由噪声数据(错误或离群数据)导致的不确定性。本文介绍了一种新颖的概率方法,该方法结合了 ML 和马尔可夫链蒙特卡罗模拟,用于:(1)检测可能存在的噪声数据并降低其权重;(2)开发一种能够在模型部署过程中检测噪声数据的方法;以及(3)解释数据点被视为噪声的原因,以帮助启发式地区分异常值和错误数据。新算法认识到,将训练数据拆分为噪声数据和干净数据的方法并不唯一,因此会产生一系列合理的拆分。该算法在不同复杂度的合成基准问题和现实世界的公共供水取水数据集中成功检测出了噪声数据。该算法具有通用性和灵活性,适用于广泛的水文和环境学科。
期刊介绍:
Environmental Modelling & Software publishes contributions, in the form of research articles, reviews and short communications, on recent advances in environmental modelling and/or software. The aim is to improve our capacity to represent, understand, predict or manage the behaviour of environmental systems at all practical scales, and to communicate those improvements to a wide scientific and professional audience.