Tackling the small imbalanced horizontal dataset regressions by Stability Selection and SMOGN: a case study of ventilation-free days prediction in the pediatric intensive care unit and the importance of PRISM
{"title":"Tackling the small imbalanced horizontal dataset regressions by Stability Selection and SMOGN: a case study of ventilation-free days prediction in the pediatric intensive care unit and the importance of PRISM","authors":"Milad Rad , Alireza Rafiei , Jocelyn Grunwell , Rishikesan Kamaleswaran","doi":"10.1016/j.ijmedinf.2025.105809","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>The regression of small imbalanced horizontal datasets is an important problem in bioinformatics due to rare but vital data points impacting model performance. Most clinical studies suffer from imbalance in their distribution which impacts the learning ability of regression or classification models. The imbalance once combined with the small number of samples reduces the prediction performance. An improvement in the trainability of small imbalanced datasets hugely improves the potency of current prediction models that rely on a small set of valuable expensive samples.</div></div><div><h3>Materials and methods</h3><div>A method called Stability Selection has been used to overcome the high dimensionality problem, which arises when the sample sizes are relatively small compared to the number of features. The method was used to improve the performance of the Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN), an imbalance removal algorithm. To test the new pipeline, a small imbalanced cohort of pediatric ICU patients was used to predict the number of Ventilator-Free Days (VFD) a patient may experience for an admission period of 28 days due to respiratory illnesses.</div></div><div><h3>Results</h3><div>Our model demonstrated its effectiveness by overcoming label imbalance while predicting almost all the non-surviving patients in the test dataset using Stability Selection before applying SMOGN. Our study also highlighted the importance of Pediatrics Risk of Mortality (PRISM) as a powerful VFD predictor if combined with other clinical features.</div></div><div><h3>Conclusion</h3><div>This paper shows how a hybrid strategy of Stability Selection, SMOGN, and regression can improve the outcome of highly imbalanced datasets and reduce the probability of highly expensive false negative detections in severe acute respiratory disease syndrome cases. The proposed modeling pipeline can reduce the overall VFD regression error but is also expandable to other regressable features. We also showed the importance of PRISM as a strong VFD predictor.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"196 ","pages":"Article 105809"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625000267","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
The regression of small imbalanced horizontal datasets is an important problem in bioinformatics due to rare but vital data points impacting model performance. Most clinical studies suffer from imbalance in their distribution which impacts the learning ability of regression or classification models. The imbalance once combined with the small number of samples reduces the prediction performance. An improvement in the trainability of small imbalanced datasets hugely improves the potency of current prediction models that rely on a small set of valuable expensive samples.
Materials and methods
A method called Stability Selection has been used to overcome the high dimensionality problem, which arises when the sample sizes are relatively small compared to the number of features. The method was used to improve the performance of the Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN), an imbalance removal algorithm. To test the new pipeline, a small imbalanced cohort of pediatric ICU patients was used to predict the number of Ventilator-Free Days (VFD) a patient may experience for an admission period of 28 days due to respiratory illnesses.
Results
Our model demonstrated its effectiveness by overcoming label imbalance while predicting almost all the non-surviving patients in the test dataset using Stability Selection before applying SMOGN. Our study also highlighted the importance of Pediatrics Risk of Mortality (PRISM) as a powerful VFD predictor if combined with other clinical features.
Conclusion
This paper shows how a hybrid strategy of Stability Selection, SMOGN, and regression can improve the outcome of highly imbalanced datasets and reduce the probability of highly expensive false negative detections in severe acute respiratory disease syndrome cases. The proposed modeling pipeline can reduce the overall VFD regression error but is also expandable to other regressable features. We also showed the importance of PRISM as a strong VFD predictor.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.