比较使用大型人口健康数据库进行预测建模的人工智能/ML 方法和经典回归方法：应用于 COVID-19 病例预测

Global Epidemiology Pub Date : 2024-10-04 DOI:10.1016/j.gloepi.2024.100168

Lise M. Bjerre , Cayden Peixoto , Rawan Alkurd , Robert Talarico , Rami Abielmona

{"title":"比较使用大型人口健康数据库进行预测建模的人工智能/ML 方法和经典回归方法：应用于 COVID-19 病例预测","authors":"Lise M. Bjerre , Cayden Peixoto , Rawan Alkurd , Robert Talarico , Rami Abielmona","doi":"10.1016/j.gloepi.2024.100168","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.</div></div><div><h3>Objectives</h3><div>This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.</div></div><div><h3>Methods</h3><div>Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.</div></div><div><h3>Results</h3><div>The cohort consisted of <em>n</em> = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of <em>n</em> = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (<em>p</em> < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.</div></div><div><h3>Conclusions</h3><div>Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.</div></div>","PeriodicalId":36311,"journal":{"name":"Global Epidemiology","volume":"8 ","pages":"Article 100168"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction\",\"authors\":\"Lise M. Bjerre , Cayden Peixoto , Rawan Alkurd , Robert Talarico , Rami Abielmona\",\"doi\":\"10.1016/j.gloepi.2024.100168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.</div></div><div><h3>Objectives</h3><div>This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.</div></div><div><h3>Methods</h3><div>Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.</div></div><div><h3>Results</h3><div>The cohort consisted of <em>n</em> = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of <em>n</em> = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (<em>p</em> < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.</div></div><div><h3>Conclusions</h3><div>Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.</div></div>\",\"PeriodicalId\":36311,\"journal\":{\"name\":\"Global Epidemiology\",\"volume\":\"8 \",\"pages\":\"Article 100168\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Global Epidemiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590113324000348\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Global Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590113324000348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景将人工智能和机器学习（AI/ML）方法与应用于大型人口健康数据库的传统统计方法进行比较的研究十分有限。方法我们利用安大略省的人口健康数据库，建立了一个安大略省渥太华市居民队列，这些居民在 2020 年 3 月 10 日至 2021 年 5 月 13 日期间接受了 COVID-19 PCR 检测。利用人口、社会经济和健康数据（包括 COVID-19 PCR 检测结果和可用的症状数据），我们开发了用于 COVID-19 病例识别的预测模型，采用的方法包括：经典多元逻辑回归 (LR)、深度神经网络 (DNN)、随机森林 (RF) 和梯度提升树 (GBT)。使用曲线下面积（AUC）群图对模型的性能进行比较，并进行 10 倍交叉验证。结果在研究期间，接受 COVID-19 检测的渥太华居民共有 n = 351,248 人。其中，共进行了 n = 883,879 次独特的 COVID-19 检测（2.6% 的检测结果为阳性）。在所有测试模型中，将 COVID-19 症状数据纳入分析可提高模型性能和可变预测值（p < 0.0001），纳入症状数据后，所有模型的 10 倍交叉验证 AUC 均接近或超过 0.7。在各种配对比较中，GBT 方法的预测能力最高（AUC = 0.796 ± 0.017），明显优于多元逻辑回归和其他人工智能/ML 方法。结论传统的基于多元回归的模型优于某些模型，而不如其他机器学习算法，能在具有合理特征数量的中等数据集中提供良好的预测准确性。不过，在可能的情况下，应考虑采用人工智能/ML GBT 方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction

Background

Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited.

Objectives

This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data.

Methods

Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation.

Results

The cohort consisted of n = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of n = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (p < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches.

Conclusions

Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊