树木，森林，鸡和蛋:何时以及为什么要在随机的森林中修剪树木

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-03-30 DOI:10.1002/sam.11594

Siyu Zhou, L. Mentch

{"title":"树木，森林，鸡和蛋:何时以及为什么要在随机的森林中修剪树木","authors":"Siyu Zhou, L. Mentch","doi":"10.1002/sam.11594","DOIUrl":null,"url":null,"abstract":"Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Trees, forests, chickens, and eggs: when and why to prune trees in a random forest\",\"authors\":\"Siyu Zhou, L. Mentch\",\"doi\":\"10.1002/sam.11594\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.\",\"PeriodicalId\":342679,\"journal\":{\"name\":\"Statistical Analysis and Data Mining: The ASA Data Science Journal\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Analysis and Data Mining: The ASA Data Science Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/sam.11594\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining: The ASA Data Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/sam.11594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

由于随机森林(RFs)作为优秀的现成预测器的长期声誉，随机森林(RFs)仍然是应用统计学家和数据科学家的首选模型。然而，尽管它们被广泛使用，直到最近，人们对它们的内部工作原理以及哪些方面的程序推动了它们的成功知之甚少。最近，出现了两个相互竞争的假设——一个基于插值，另一个基于正则化。这项工作通过利用正则化框架来重新审视几十年前的问题，即是否应该修剪集合中的单个树，从而支持后者。尽管在大多数流行的软件包中，RFs的默认构造使用接近全深度树，但在这里，我们提供了强有力的证据，证明树深度应该被视为贯穿整个过程的正则化的自然形式。特别是，我们的工作表明，当数据中的信噪比较低时，具有浅树的rf是有利的。在建立这一论点的过程中，我们还批评了RFs中新近流行的“双重下降”概念，通过将其与U统计进行类比，并认为随机森林精度的明显跳跃是简单平均而不是插值的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Trees, forests, chickens, and eggs: when and why to prune trees in a random forest

Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Analysis and Data Mining: The ASA Data Science Journal

自引率

0.00%

发文量