Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance

IF 1.3 Q3 STATISTICS & PROBABILITY Journal of Probability and Statistics Pub Date : 2022-12-01 DOI:10.1155/2022/2833537

C. A. Mushagalusa, A. B. Fandohan, R. G. Glèlè Kakaï

{"title":"Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance","authors":"C. A. Mushagalusa, A. B. Fandohan, R. G. Glèlè Kakaï","doi":"10.1155/2022/2833537","DOIUrl":null,"url":null,"abstract":"Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.","PeriodicalId":44760,"journal":{"name":"Journal of Probability and Statistics","volume":" ","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Probability and Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2022/2833537","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

随机森林计数数据建模:数据特征和过度分散对回归性能的影响分析

机器学习算法，特别是随机森林(RFs)，已经成为现代科学方法论的一个组成部分，代表了传统参数算法的有效替代方案。本研究旨在评估数据特征和过色散对射频回归性能的影响。我们评估了预测因子类型(100、75、50和20%连续，100%分类)、预测因子数量(p = 816和24)和样本量(N = 50、250和1250)对射频参数设置的影响。我们还将射频性能与经典广义线性模型(泊松、负二项和零膨胀泊松)和应用于对数变换数据的线性模型进行了比较。分析了两个真实数据集，以证明RF对过度分散数据建模的有用性。拟合优度统计如均方根误差(RMSE)和偏倚被用来确定射频的准确性和有效性。结果表明，每次分割随机选择的变量数量、用于训练模型的样本比例、每个终端节点内的最小样本数量以及RF回归性能不受样本量、数量和预测因子类型的影响。然而，观测值与预测数的比值会影响最佳射频参数的稳定性。RF对所有类型的协变量和不同程度的分散表现良好。离散度的大小对射频预测效度没有显著影响。相反，它的预测精度受到响应变量的离散程度的显著影响，这取决于解释变量。在存在过色散的情况下，RF的表现几乎与经典泊松族模型一样好。考虑到RF的优点，它是统计数据的合适选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Probability and Statistics STATISTICS & PROBABILITY-

自引率

0.00%

发文量

审稿时长

18 weeks