Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE PeerJ Computer Science Pub Date : 2024-11-18 eCollection Date: 2024-01-01 DOI:10.7717/peerj-cs.2445

Helen L Smith, Patrick J Biggs, Nigel P French, Adam N H Smith, Jonathan C Marshall

{"title":"Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.","authors":"Helen L Smith, Patrick J Biggs, Nigel P French, Adam N H Smith, Jonathan C Marshall","doi":"10.7717/peerj-cs.2445","DOIUrl":null,"url":null,"abstract":"Performance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based vs. target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"10 ","pages":"e2445"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11623134/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2445","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Performance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based vs. target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

袋外编码分类预测因子影响袋外样本。

随机森林分类模型的性能通常使用袋外样本（OOB）来评估和解释。当树被训练时的OOB观测值可以作为该树的测试集和用于计算OOB误差和可变重要性度量（VIM）的OOB观测值的预测。OOB误差很受欢迎，因为它们可以快速计算，并且对于大样本来说，是对真实预测误差的良好估计。在这项研究中，我们研究了随机森林分类预测变量基于目标的编码与目标不可知的编码如何对基于OOB样本的性能测量产生偏差。我们表明，当使用基于目标的编码方法对分类变量进行编码时，当编码发生在装袋之前时，OOB样本可能低估了真实的误分类率，并高估了变量的重要性。我们建议在评估使用基于目标的编码方法的基于树的方法的变量重要性和/或预测性能时使用单独的测试数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.