The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration.

Peter C Austin, Douglas S Lee, Bo Wang
{"title":"The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration.","authors":"Peter C Austin, Douglas S Lee, Bo Wang","doi":"10.1186/s41512-024-00179-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Machine learning methods are increasingly being used to predict clinical outcomes. Optimism is the difference in model performance between derivation and validation samples. The term \"data hungriness\" refers to the sample size needed for a modelling technique to generate a prediction model with minimal optimism. Our objective was to compare the relative data hungriness of different statistical and machine learning methods when assessed using calibration.</p><p><strong>Methods: </strong>We used Monte Carlo simulations to assess the effect of number of events per variable (EPV) on the optimism of six learning methods when assessing model calibration: unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and stochastic gradient boosting machines using trees as the base learners. We performed simulations in two large cardiovascular datasets each of which comprised an independent derivation and validation sample: patients hospitalized with acute myocardial infarction and patients hospitalized with heart failure. We used six data-generating processes, each based on one of the six learning methods. We allowed the sample sizes to be such that the number of EPV ranged from 10 to 200 in increments of 10. We applied six prediction methods in each of the simulated derivation samples and evaluated calibration in the simulated validation samples using the integrated calibration index, the calibration intercept, and the calibration slope. We also examined Nagelkerke's R<sup>2</sup>, the scaled Brier score, and the c-statistic.</p><p><strong>Results: </strong>Across all 12 scenarios (2 diseases × 6 data-generating processes), penalized logistic regression displayed very low optimism even when the number of EPV was very low. Random forests and bagged trees tended to be the most data hungry and displayed the greatest optimism.</p><p><strong>Conclusions: </strong>When assessed using calibration, penalized logistic regression was substantially less data hungry than methods from the machine learning literature.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"8 1","pages":"15"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539735/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and prognostic research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41512-024-00179-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Machine learning methods are increasingly being used to predict clinical outcomes. Optimism is the difference in model performance between derivation and validation samples. The term "data hungriness" refers to the sample size needed for a modelling technique to generate a prediction model with minimal optimism. Our objective was to compare the relative data hungriness of different statistical and machine learning methods when assessed using calibration.

Methods: We used Monte Carlo simulations to assess the effect of number of events per variable (EPV) on the optimism of six learning methods when assessing model calibration: unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and stochastic gradient boosting machines using trees as the base learners. We performed simulations in two large cardiovascular datasets each of which comprised an independent derivation and validation sample: patients hospitalized with acute myocardial infarction and patients hospitalized with heart failure. We used six data-generating processes, each based on one of the six learning methods. We allowed the sample sizes to be such that the number of EPV ranged from 10 to 200 in increments of 10. We applied six prediction methods in each of the simulated derivation samples and evaluated calibration in the simulated validation samples using the integrated calibration index, the calibration intercept, and the calibration slope. We also examined Nagelkerke's R2, the scaled Brier score, and the c-statistic.

Results: Across all 12 scenarios (2 diseases × 6 data-generating processes), penalized logistic regression displayed very low optimism even when the number of EPV was very low. Random forests and bagged trees tended to be the most data hungry and displayed the greatest optimism.

Conclusions: When assessed using calibration, penalized logistic regression was substantially less data hungry than methods from the machine learning literature.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
无惩罚和有惩罚逻辑回归以及基于集合的机器学习方法的相对数据饥渴度:校准案例。
背景:机器学习方法越来越多地被用于预测临床结果。乐观度是推导样本和验证样本之间模型性能的差异。术语 "数据饥渴度 "指的是建模技术生成预测模型所需的样本量,该模型的乐观程度最低。我们的目标是比较不同统计方法和机器学习方法在使用校准评估时的相对数据饥饿度:我们使用蒙特卡罗模拟来评估在评估模型校准时,每个变量的事件数(EPV)对以下六种学习方法的乐观程度的影响:非惩罚性逻辑回归、脊回归、套索回归、袋装分类树、随机森林和使用树作为基础学习器的随机梯度提升机。我们在两个大型心血管数据集上进行了模拟,每个数据集由独立的推导和验证样本组成:急性心肌梗死住院患者和心力衰竭住院患者。我们使用了六种数据生成流程,每种流程都基于六种学习方法中的一种。我们允许样本大小为 EPV 数量在 10 到 200 之间,以 10 为增量。我们在每个模拟推导样本中应用了六种预测方法,并使用综合校准指数、校准截距和校准斜率评估了模拟验证样本中的校准情况。我们还检查了纳格尔克 R2、标度布赖尔得分和 c 统计量:在所有 12 种情况下(2 种疾病 × 6 个数据生成过程),即使 EPV 的数量非常少,惩罚逻辑回归也显示出非常低的乐观程度。随机森林和袋装树往往最需要数据,并显示出最大的乐观性:结论:在使用校准进行评估时,惩罚逻辑回归对数据的需求远远低于机器学习文献中的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
18 weeks
期刊最新文献
Risk prediction tools for pressure injury occurrence: an umbrella review of systematic reviews reporting model development and validation methods. Rehabilitation outcomes after comprehensive post-acute inpatient rehabilitation following moderate to severe acquired brain injury-study protocol for an overall prognosis study based on routinely collected health data. Validation of prognostic models predicting mortality or ICU admission in patients with COVID-19 in low- and middle-income countries: a global individual participant data meta-analysis. Reported prevalence and comparison of diagnostic approaches for Candida africana: a systematic review with meta-analysis. The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1