Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction

Global Epidemiology Pub Date : 2024-12-01 Epub Date: 2024-10-04 DOI:10.1016/j.gloepi.2024.100168
Lise M. Bjerre , Cayden Peixoto , Rawan Alkurd , Robert Talarico , Rami Abielmona
{"title":"Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction","authors":"Lise M. Bjerre ,&nbsp;Cayden Peixoto ,&nbsp;Rawan Alkurd ,&nbsp;Robert Talarico ,&nbsp;Rami Abielmona","doi":"10.1016/j.gloepi.2024.100168","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.</div></div><div><h3>Objectives</h3><div>This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.</div></div><div><h3>Methods</h3><div>Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.</div></div><div><h3>Results</h3><div>The cohort consisted of <em>n</em> = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of <em>n</em> = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (<em>p</em> &lt; 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.</div></div><div><h3>Conclusions</h3><div>Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.</div></div>","PeriodicalId":36311,"journal":{"name":"Global Epidemiology","volume":"8 ","pages":"Article 100168"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Global Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590113324000348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/4 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.

Objectives

This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.

Methods

Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.

Results

The cohort consisted of n = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of n = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (p < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.

Conclusions

Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
比较使用大型人口健康数据库进行预测建模的人工智能/ML 方法和经典回归方法:应用于 COVID-19 病例预测
背景将人工智能和机器学习(AI/ML)方法与应用于大型人口健康数据库的传统统计方法进行比较的研究十分有限。方法我们利用安大略省的人口健康数据库,建立了一个安大略省渥太华市居民队列,这些居民在 2020 年 3 月 10 日至 2021 年 5 月 13 日期间接受了 COVID-19 PCR 检测。利用人口、社会经济和健康数据(包括 COVID-19 PCR 检测结果和可用的症状数据),我们开发了用于 COVID-19 病例识别的预测模型,采用的方法包括:经典多元逻辑回归 (LR)、深度神经网络 (DNN)、随机森林 (RF) 和梯度提升树 (GBT)。使用曲线下面积(AUC)群图对模型的性能进行比较,并进行 10 倍交叉验证。结果在研究期间,接受 COVID-19 检测的渥太华居民共有 n = 351,248 人。其中,共进行了 n = 883,879 次独特的 COVID-19 检测(2.6% 的检测结果为阳性)。在所有测试模型中,将 COVID-19 症状数据纳入分析可提高模型性能和可变预测值(p < 0.0001),纳入症状数据后,所有模型的 10 倍交叉验证 AUC 均接近或超过 0.7。在各种配对比较中,GBT 方法的预测能力最高(AUC = 0.796 ± 0.017),明显优于多元逻辑回归和其他人工智能/ML 方法。结论传统的基于多元回归的模型优于某些模型,而不如其他机器学习算法,能在具有合理特征数量的中等数据集中提供良好的预测准确性。不过,在可能的情况下,应考虑采用人工智能/ML GBT 方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Global Epidemiology
Global Epidemiology Medicine-Infectious Diseases
CiteScore
5.00
自引率
0.00%
发文量
22
审稿时长
39 days
期刊最新文献
Estimation of the global number of nicotine vapers in 2025 Epidemiology of alcohol use and alcohol use disorders among the population of Buea, south west region, Cameroon: A survey study AlzStack: Forecasting early-onset Alzheimer's with an explainable AI system using multiple data balancing techniques Socioeconomic and regional determinants of optimal antenatal care utilization among women in South and Central Somalia Unveiling hidden heterogeneity and inequalities in the continuum of care for reproductive, maternal, and child health services in sub-Saharan Africa: A multilevel latent class analysis approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1