Machine learning in public health informatics: Evidence that complex sampling structures may not be needed for prediction models with imbalanced outcomes
{"title":"Machine learning in public health informatics: Evidence that complex sampling structures may not be needed for prediction models with imbalanced outcomes","authors":"Zhengye Si, Jinpu Li, Emily Leary Ph.D.","doi":"10.1016/j.annepidem.2024.12.016","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>The objective of this study is to investigate the predictive ability of machine learning models for imbalanced outcomes from national survey data without the use of sampling weights.</div></div><div><h3>Methods</h3><div>We evaluated the predictive performance of machine learning models on imbalanced outcomes from the US National Health and Nutrition Examination Survey (USNHANES) without using sampling weights. Four machine learning models (support vector machine, random forest, least absolute shrinkage and selection operator regression, and deep neural network) were compared with a logistic model that incorporates the survey's complex sampling design. Three resampling methods (oversampling, undersampling, and combined) were used to address class imbalance during the model training process.</div></div><div><h3>Results</h3><div>For all models, the balanced accuracy was similar (ranging from 0.72 to 0.76) and the specificity was smaller than sensitivity except for random forest. The support vector machine and neural networks performed best with sensitivity (ranging from 0.79 to 0.83), while the random forest had the largest specificity (ranging from 0.86 to 0.96), with one exception. PR-AUC scores and Brier scores were low ranging from 0.2529 to 0.3313 (lower scores worse) and 0.1005–0.3245 (lower scores better), respectively</div></div><div><h3>Conclusions</h3><div>The machine learning models had overall similar predictive capacity to the recommended methods which integrate the complex sampling design for the prediction of osteoarthritis occurrence with USNHANES.</div></div>","PeriodicalId":50767,"journal":{"name":"Annals of Epidemiology","volume":"102 ","pages":"Pages 75-80"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047279724002886","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
The objective of this study is to investigate the predictive ability of machine learning models for imbalanced outcomes from national survey data without the use of sampling weights.
Methods
We evaluated the predictive performance of machine learning models on imbalanced outcomes from the US National Health and Nutrition Examination Survey (USNHANES) without using sampling weights. Four machine learning models (support vector machine, random forest, least absolute shrinkage and selection operator regression, and deep neural network) were compared with a logistic model that incorporates the survey's complex sampling design. Three resampling methods (oversampling, undersampling, and combined) were used to address class imbalance during the model training process.
Results
For all models, the balanced accuracy was similar (ranging from 0.72 to 0.76) and the specificity was smaller than sensitivity except for random forest. The support vector machine and neural networks performed best with sensitivity (ranging from 0.79 to 0.83), while the random forest had the largest specificity (ranging from 0.86 to 0.96), with one exception. PR-AUC scores and Brier scores were low ranging from 0.2529 to 0.3313 (lower scores worse) and 0.1005–0.3245 (lower scores better), respectively
Conclusions
The machine learning models had overall similar predictive capacity to the recommended methods which integrate the complex sampling design for the prediction of osteoarthritis occurrence with USNHANES.
期刊介绍:
The journal emphasizes the application of epidemiologic methods to issues that affect the distribution and determinants of human illness in diverse contexts. Its primary focus is on chronic and acute conditions of diverse etiologies and of major importance to clinical medicine, public health, and health care delivery.