Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
Lise M. Bjerre , Cayden Peixoto , Rawan Alkurd , Robert Talarico , Rami Abielmona
{"title":"Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction","authors":"Lise M. Bjerre , Cayden Peixoto , Rawan Alkurd , Robert Talarico , Rami Abielmona","doi":"10.1016/j.gloepi.2024.100168","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.</div></div><div><h3>Objectives</h3><div>This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.</div></div><div><h3>Methods</h3><div>Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.</div></div><div><h3>Results</h3><div>The cohort consisted of <em>n</em> = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of <em>n</em> = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (<em>p</em> < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.</div></div><div><h3>Conclusions</h3><div>Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.</div></div>","PeriodicalId":36311,"journal":{"name":"Global Epidemiology","volume":"8 ","pages":"Article 100168"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Global Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590113324000348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.
Objectives
This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.
Methods
Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.
Results
The cohort consisted of n = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of n = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (p < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.
Conclusions
Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.