Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis.

IF 1.8 4区 医学 Q3 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Maternal and Child Health Journal Pub Date : 2025-01-15 DOI:10.1007/s10995-025-04038-1
Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons
{"title":"Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis.","authors":"Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons","doi":"10.1007/s10995-025-04038-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.</p><p><strong>Methods: </strong>Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.</p><p><strong>Results: </strong>The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.</p><p><strong>Discussion: </strong>The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.</p>","PeriodicalId":48367,"journal":{"name":"Maternal and Child Health Journal","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Maternal and Child Health Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10995-025-04038-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.

Methods: Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.

Results: The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.

Discussion: The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
Maternal and Child Health Journal
Maternal and Child Health Journal PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-
CiteScore
3.20
自引率
4.30%
发文量
271
期刊介绍: Maternal and Child Health Journal is the first exclusive forum to advance the scientific and professional knowledge base of the maternal and child health (MCH) field. This bimonthly provides peer-reviewed papers addressing the following areas of MCH practice, policy, and research: MCH epidemiology, demography, and health status assessment Innovative MCH service initiatives Implementation of MCH programs MCH policy analysis and advocacy MCH professional development. Exploring the full spectrum of the MCH field, Maternal and Child Health Journal is an important tool for practitioners as well as academics in public health, obstetrics, gynecology, prenatal medicine, pediatrics, and neonatology. Sponsors include the Association of Maternal and Child Health Programs (AMCHP), the Association of Teachers of Maternal and Child Health (ATMCH), and CityMatCH.
期刊最新文献
Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis. Homelessness and Birth Outcomes in the Pregnancy Risk Assessment Monitoring System, 2016-2020. What Do Mothers Know About Nutrition? Impacts on Childhood Nutrition Outcomes in Sub-Saharan Africa. An Investigation of Adolescent Mental Health In a New York City Cohort Before and During the COVID-19 Pandemic in the Primary Care Setting. A Needs Assessment of Labor and Delivery Nurses Performing NRP in the Delivery Room.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1