Understanding overfitting in random forest for probability estimation: a visualization and simulation study.

Diagnostic and prognostic research Pub Date : 2024-09-27 DOI:10.1186/s41512-024-00177-1

Lasai Barreñada, Paula Dhiman, Dirk Timmerman, Anne-Laure Boulesteix, Ben Van Calster

{"title":"Understanding overfitting in random forest for probability estimation: a visualization and simulation study.","authors":"Lasai Barreñada, Paula Dhiman, Dirk Timmerman, Anne-Laure Boulesteix, Ben Van Calster","doi":"10.1186/s41512-024-00177-1","DOIUrl":null,"url":null,"abstract":"Background: Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.Methods: For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).Results: The visualizations suggested that the model learned \"spikes of probability\" around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation - 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size.Conclusions: Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"8 1","pages":"14"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437774/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and prognostic research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41512-024-00177-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.

Methods: For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).

Results: The visualizations suggested that the model learned "spikes of probability" around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation - 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size.

Conclusions: Random forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

理解用于概率估计的随机森林中的过度拟合：一项可视化和模拟研究。

背景随机森林已成为临床风险预测建模的常用方法。在一项预测卵巢恶性肿瘤的案例研究中，我们观察到训练 AUC 接近 1。虽然这表明存在过度拟合的情况，但在测试数据上的表现还是很有竞争力的。我们旨在通过（1）可视化三个真实世界案例研究中的数据空间和（2）模拟研究来了解随机森林在概率估计中的行为：在案例研究中，使用二维子空间中的热图对多叉风险估计进行可视化。模拟研究包括 48 种逻辑数据生成机制（DGM），预测因子的分布、预测因子的数量、预测因子之间的相关性、真实 AUC 和真实预测因子的强度各不相同。针对每种 DGM，模拟了 1000 个大小为 200 或 4000、结果为二进制的训练数据集，并使用 ranger R 软件包训练了最小节点大小为 2 或 20 的随机森林模型，总共产生了 192 个方案。在大型测试数据集（N = 100,000）上对模型性能进行了评估：可视化结果表明，模型围绕训练集中的事件学习到了 "概率峰值"。事件集群产生了更大的峰值或高原（信号），孤立事件产生了局部峰值（噪音）。在模拟研究中，除非有 4 个二进制预测因子或 16 个二进制预测因子（最小节点大小为 20），否则训练 AUC 的中位数介于 0.97 和 1 之间。分辨损失中位数，即测试 AUC 中位数与真实 AUC 之间的差值为 0.025（范围为 0.00 至 0.13）。训练 AUC 中位数与辨别损失的 Spearman 相关性约为 0.70。每个变量的事件数越多、最小节点大小越大、二元预测因子越多，测试 AUC 的中值就越高。训练校准斜率中值始终高于 1，且与各种情况下的测试斜率中值不相关（Spearman 相关性 - 0.11）。真实 AUC 越高、最小节点规模越大、样本规模越大，测试斜率中值越高：结论：随机森林学习局部概率峰值，往往能获得接近完美的训练 AUC，而不会对测试数据的 AUC 产生强烈影响。当目标是概率估计时，模拟结果与在随机森林模型中使用完全生长树的常见建议背道而驰。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Diagnostic and prognostic research

自引率

0.00%

发文量

审稿时长

18 weeks