Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons
{"title":"Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis.","authors":"Patricia Gilholm, Paula Lister, Adam Irwin, Amanda Harley, Sainath Raman, Luregn J Schlapbach, Kristen S Gibbons","doi":"10.1007/s10995-025-04038-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.</p><p><strong>Methods: </strong>Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.</p><p><strong>Results: </strong>The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.</p><p><strong>Discussion: </strong>The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.</p>","PeriodicalId":48367,"journal":{"name":"Maternal and Child Health Journal","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Maternal and Child Health Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10995-025-04038-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Variable selection is a common technique to identify the most predictive variables from a pool of candidate predictors. Low prevalence predictors (LPPs) are frequently found in clinical data, yet few studies have explored their impact on model performance during variable selection. This study compared the Random Forest (RF) algorithm and stepwise regression (SWR) for variable selection using data from a paediatric sepsis screening tool, where 18 out of 32 predictors had a prevalence < 10%.
Methods: Variable selection using RF was compared to forward and backward SWR. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and the variables retained. Additionally, a simulation study assessed how increasing the prevalence of the predictors impacted the variable selection results.
Results: The best fitting RF and SWR models retained were 22, and 17 predictors, respectively, with 14 and 10 predictors having a prevalence < 10%. Both the RF and SWR models had similar predictive performance (RF: AUC [95% Confidence Interval] 0.79 [0.77, 0.81], LR: 0.80 [0.78, 0.82]). The simulation study revealed differences for both RF and SWR models in variable importance rankings and predictor selection with increasing prevalence thresholds, particularly for moderately and strongly associated predictors.
Discussion: The RF algorithm retained a number of very low prevalence predictors compared to SWR. However, the predictive performance of both models were comparable, demonstrating that when applied correctly and the number of candidate predictors is small, both methods are suitable for variable selection when using low prevalence predictors.
期刊介绍:
Maternal and Child Health Journal is the first exclusive forum to advance the scientific and professional knowledge base of the maternal and child health (MCH) field. This bimonthly provides peer-reviewed papers addressing the following areas of MCH practice, policy, and research: MCH epidemiology, demography, and health status assessment
Innovative MCH service initiatives
Implementation of MCH programs
MCH policy analysis and advocacy
MCH professional development.
Exploring the full spectrum of the MCH field, Maternal and Child Health Journal is an important tool for practitioners as well as academics in public health, obstetrics, gynecology, prenatal medicine, pediatrics, and neonatology.
Sponsors include the Association of Maternal and Child Health Programs (AMCHP), the Association of Teachers of Maternal and Child Health (ATMCH), and CityMatCH.