Comparing Regression Models with Count Data to Artificial Neural Network and Ensemble Models for Prediction of Generic Escherichia coli Population in Agricultural Ponds Based on Weather Station Measurements

IF 4 4区环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES Microbial Risk Analysis Pub Date : 2021-12-01 DOI:10.1016/j.mran.2021.100171

Gonca Buyrukoğlu , Selim Buyrukoğlu , Zeynal Topalcengiz

{"title":"Comparing Regression Models with Count Data to Artificial Neural Network and Ensemble Models for Prediction of Generic Escherichia coli Population in Agricultural Ponds Based on Weather Station Measurements","authors":"Gonca Buyrukoğlu , Selim Buyrukoğlu , Zeynal Topalcengiz","doi":"10.1016/j.mran.2021.100171","DOIUrl":null,"url":null,"abstract":"<div>Indicator microorganisms are monitored in agricultural waters to foster produce safety. Various prediction models are used to estimate the population of indicator microorganisms and pathogens when no observation is available. The purpose of this study was to compare the performance of regression models with count data (zero-inflated Poisson and hurdle negative binomial) to artificial neural network and ensemble models (random forest and AdaBoost) for the prediction of generic Escherichia coli population in agricultural surface waters in relation with weather station measurements. Two-part count data models were built on E. coli population count frequencies (0, [1,10), [10,100), [100,1000), [1000, 10000), (>=10000)) based on the data structure. The use of artificial neural network, AdaBoost, and random forest were determined based on the mean absolute error (MAE) value over pre-tested six models. The MAE was also used to compare the performance of two-part count data models with artificial neural network and ensemble models. Over-dispersed E. coli population count frequencies was calculated between 2.2 and 52.2% for all ponds. Observed and predicted zero E. coli population counts for all ponds were matched from 82 to 100% for zero-inflated Poisson and 100% for hurdle negative binomial regression models. Overdispersion reduced the performance of tested models. AdaBoost-Twelve Estimators had the best performance with the lowest MAE values for all ponds (from 0.87 to 46.60). The ensemble models used in this study provided more promising performance when compared to tested regression models with count data.</div>","PeriodicalId":48593,"journal":{"name":"Microbial Risk Analysis","volume":"19 ","pages":"Article 100171"},"PeriodicalIF":4.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.mran.2021.100171","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Risk Analysis","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S235235222100013X","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 15

Abstract

Indicator microorganisms are monitored in agricultural waters to foster produce safety. Various prediction models are used to estimate the population of indicator microorganisms and pathogens when no observation is available. The purpose of this study was to compare the performance of regression models with count data (zero-inflated Poisson and hurdle negative binomial) to artificial neural network and ensemble models (random forest and AdaBoost) for the prediction of generic Escherichia coli population in agricultural surface waters in relation with weather station measurements. Two-part count data models were built on E. coli population count frequencies (0, [1,10), [10,100), [100,1000), [1000, 10000), (>=10000)) based on the data structure. The use of artificial neural network, AdaBoost, and random forest were determined based on the mean absolute error (MAE) value over pre-tested six models. The MAE was also used to compare the performance of two-part count data models with artificial neural network and ensemble models. Over-dispersed E. coli population count frequencies was calculated between 2.2 and 52.2% for all ponds. Observed and predicted zero E. coli population counts for all ponds were matched from 82 to 100% for zero-inflated Poisson and 100% for hurdle negative binomial regression models. Overdispersion reduced the performance of tested models. AdaBoost-Twelve Estimators had the best performance with the lowest MAE values for all ponds (from 0.87 to 46.60). The ensemble models used in this study provided more promising performance when compared to tested regression models with count data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于气象站数据的农业池塘一般大肠杆菌种群预测与人工神经网络和集合模型的比较

监测农业用水中的指示微生物，以促进生产安全。在没有观测资料的情况下，使用各种预测模型来估计指示微生物和病原体的种群。本研究的目的是比较使用计数数据(零膨胀泊松和障碍负二项)的回归模型与人工神经网络和集合模型(随机森林和AdaBoost)的性能，以预测与气象站测量数据相关的农业地表水中的一般大肠杆菌种群。基于数据结构，以大肠杆菌种群计数频率(0，[1,10)，[10,100)，[100,1000)，[1000,10000)，(>=10000))为基础，建立两部分计数数据模型。根据预先测试的六个模型的平均绝对误差(MAE)值确定人工神经网络、AdaBoost和随机森林的使用。MAE还用于比较两部分计数数据模型与人工神经网络和集成模型的性能。所有池塘的过度分散大肠杆菌种群计数频率在2.2 ~ 52.2%之间。观察到的和预测的所有池塘的大肠杆菌种群数为零，在零膨胀泊松模型中为82% - 100%，在跨栏负二项回归模型中为100%。过度分散降低了测试模型的性能。adaboost - 12 Estimators在所有池塘中表现最好，MAE值最低(从0.87到46.60)。与已测试的计数回归模型相比，本研究中使用的集成模型提供了更有希望的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Microbial Risk Analysis Medicine-Microbiology (medical)

CiteScore

5.70

自引率

7.10%

发文量

审稿时长

52 days

期刊介绍： The journal Microbial Risk Analysis accepts articles dealing with the study of risk analysis applied to microbial hazards. Manuscripts should at least cover any of the components of risk assessment (risk characterization, exposure assessment, etc.), risk management and/or risk communication in any microbiology field (clinical, environmental, food, veterinary, etc.). This journal also accepts article dealing with predictive microbiology, quantitative microbial ecology, mathematical modeling, risk studies applied to microbial ecology, quantitative microbiology for epidemiological studies, statistical methods applied to microbiology, and laws and regulatory policies aimed at lessening the risk of microbial hazards. Work focusing on risk studies of viruses, parasites, microbial toxins, antimicrobial resistant organisms, genetically modified organisms (GMOs), and recombinant DNA products are also acceptable.