Carine Savalli, Roberta Moreira Wichmann, Fabiano Barcellos Filho, Fernando Timoteo Fernandes, Alexandre Dias Porto Chiavegatto Filho
{"title":"Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning.","authors":"Carine Savalli, Roberta Moreira Wichmann, Fabiano Barcellos Filho, Fernando Timoteo Fernandes, Alexandre Dias Porto Chiavegatto Filho","doi":"10.1371/journal.pdig.0000699","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) is a promising tool in assisting clinical decision-making for improving diagnosis and prognosis, especially in developing regions. It is often used with large samples, aggregating data from different regions and hospitals. However, it is unclear how this affects predictions in local centers. This study aims to compare data aggregation strategies of several hospitals in Brazil with a local training strategy in each hospital to predict two COVID-19 outcomes: Intensive Care Unit admission (ICU) and mechanical ventilation use (MV). The study included 6,046 patients from 14 hospitals, with local sample sizes ranging from 47 to 1500 patients. Machine learning models were trained using extreme gradient boosting, lightGBM, and catboost for structured data. Seven data aggregation strategies based on hospital geographic regions were compared with local training, and the best strategy was determined by analyzing the area under the ROC curve (AUROC). SHAP (Shapley Additive exPlanations) values were used to assess the contribution of variables to predictions. Additionally, a metafeatures analysis examined how hospital characteristics influence the selection of the best strategy. The study found that the local training strategy was the most effective approach, in the case of ICU outcomes, for 11 of the 14 hospitals (79%), and, in the case of MV, for 10 hospitals (71%). Metafeatures analysis suggested that hospitals with smaller sample sizes generally performed better using an aggregated data strategy compared to local training. Our study brings to light an important concern about the impact of grouping data from different hospitals in predictive machine learning models. These findings contribute to the ongoing debate about the trade-off between increasing sample size and bringing together heterogeneous scenarios.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 12","pages":"e0000699"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11670925/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000699","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) is a promising tool in assisting clinical decision-making for improving diagnosis and prognosis, especially in developing regions. It is often used with large samples, aggregating data from different regions and hospitals. However, it is unclear how this affects predictions in local centers. This study aims to compare data aggregation strategies of several hospitals in Brazil with a local training strategy in each hospital to predict two COVID-19 outcomes: Intensive Care Unit admission (ICU) and mechanical ventilation use (MV). The study included 6,046 patients from 14 hospitals, with local sample sizes ranging from 47 to 1500 patients. Machine learning models were trained using extreme gradient boosting, lightGBM, and catboost for structured data. Seven data aggregation strategies based on hospital geographic regions were compared with local training, and the best strategy was determined by analyzing the area under the ROC curve (AUROC). SHAP (Shapley Additive exPlanations) values were used to assess the contribution of variables to predictions. Additionally, a metafeatures analysis examined how hospital characteristics influence the selection of the best strategy. The study found that the local training strategy was the most effective approach, in the case of ICU outcomes, for 11 of the 14 hospitals (79%), and, in the case of MV, for 10 hospitals (71%). Metafeatures analysis suggested that hospitals with smaller sample sizes generally performed better using an aggregated data strategy compared to local training. Our study brings to light an important concern about the impact of grouping data from different hospitals in predictive machine learning models. These findings contribute to the ongoing debate about the trade-off between increasing sample size and bringing together heterogeneous scenarios.