Comparing Regression Models with Count Data to Artificial Neural Network and Ensemble Models for Prediction of Generic Escherichia coli Population in Agricultural Ponds Based on Weather Station Measurements
Gonca Buyrukoğlu , Selim Buyrukoğlu , Zeynal Topalcengiz
{"title":"Comparing Regression Models with Count Data to Artificial Neural Network and Ensemble Models for Prediction of Generic Escherichia coli Population in Agricultural Ponds Based on Weather Station Measurements","authors":"Gonca Buyrukoğlu , Selim Buyrukoğlu , Zeynal Topalcengiz","doi":"10.1016/j.mran.2021.100171","DOIUrl":null,"url":null,"abstract":"<div><p><span>Indicator microorganisms are monitored in agricultural waters to foster produce safety. Various prediction models are used to estimate the population of indicator microorganisms and pathogens when no observation is available. The purpose of this study was to compare the performance of regression models with count data (zero-inflated Poisson and hurdle negative binomial) to artificial neural network and ensemble models (random forest and AdaBoost) for the prediction of generic </span><em>Escherichia coli</em> population in agricultural surface waters in relation with weather station measurements. Two-part count data models were built on <em>E. coli</em> population count frequencies (0, [1,10), [10,100), [100,1000), [1000, 10000), (>=10000)) based on the data structure. The use of artificial neural network, AdaBoost, and random forest were determined based on the mean absolute error (MAE) value over pre-tested six models. The MAE was also used to compare the performance of two-part count data models with artificial neural network and ensemble models. Over-dispersed <em>E. coli</em> population count frequencies was calculated between 2.2 and 52.2% for all ponds. Observed and predicted zero <em>E. coli</em> population counts for all ponds were matched from 82 to 100% for zero-inflated Poisson and 100% for hurdle negative binomial regression models. Overdispersion reduced the performance of tested models. AdaBoost-Twelve Estimators had the best performance with the lowest MAE values for all ponds (from 0.87 to 46.60). The ensemble models used in this study provided more promising performance when compared to tested regression models with count data.</p></div>","PeriodicalId":48593,"journal":{"name":"Microbial Risk Analysis","volume":"19 ","pages":"Article 100171"},"PeriodicalIF":3.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.mran.2021.100171","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Risk Analysis","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S235235222100013X","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 15
Abstract
Indicator microorganisms are monitored in agricultural waters to foster produce safety. Various prediction models are used to estimate the population of indicator microorganisms and pathogens when no observation is available. The purpose of this study was to compare the performance of regression models with count data (zero-inflated Poisson and hurdle negative binomial) to artificial neural network and ensemble models (random forest and AdaBoost) for the prediction of generic Escherichia coli population in agricultural surface waters in relation with weather station measurements. Two-part count data models were built on E. coli population count frequencies (0, [1,10), [10,100), [100,1000), [1000, 10000), (>=10000)) based on the data structure. The use of artificial neural network, AdaBoost, and random forest were determined based on the mean absolute error (MAE) value over pre-tested six models. The MAE was also used to compare the performance of two-part count data models with artificial neural network and ensemble models. Over-dispersed E. coli population count frequencies was calculated between 2.2 and 52.2% for all ponds. Observed and predicted zero E. coli population counts for all ponds were matched from 82 to 100% for zero-inflated Poisson and 100% for hurdle negative binomial regression models. Overdispersion reduced the performance of tested models. AdaBoost-Twelve Estimators had the best performance with the lowest MAE values for all ponds (from 0.87 to 46.60). The ensemble models used in this study provided more promising performance when compared to tested regression models with count data.
期刊介绍:
The journal Microbial Risk Analysis accepts articles dealing with the study of risk analysis applied to microbial hazards. Manuscripts should at least cover any of the components of risk assessment (risk characterization, exposure assessment, etc.), risk management and/or risk communication in any microbiology field (clinical, environmental, food, veterinary, etc.). This journal also accepts article dealing with predictive microbiology, quantitative microbial ecology, mathematical modeling, risk studies applied to microbial ecology, quantitative microbiology for epidemiological studies, statistical methods applied to microbiology, and laws and regulatory policies aimed at lessening the risk of microbial hazards. Work focusing on risk studies of viruses, parasites, microbial toxins, antimicrobial resistant organisms, genetically modified organisms (GMOs), and recombinant DNA products are also acceptable.