Brandon Ho, Meng Lu, Xuan Wang, Russell Butler, Joshua Park, Dennis Ren
{"title":"Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels.","authors":"Brandon Ho, Meng Lu, Xuan Wang, Russell Butler, Joshua Park, Dennis Ren","doi":"10.1097/PEC.0000000000003315","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Evaluate the accuracy and reliability of various generative artificial intelligence (AI) models (ChatGPT-3.5, ChatGPT-4.0, T5, Llama-2, Mistral-Large, and Claude-3 Opus) in predicting Emergency Severity Index (ESI) levels for pediatric emergency department patients and assess the impact of medically oriented fine-tuning.</p><p><strong>Methods: </strong>Seventy pediatric clinical vignettes from the ESI Handbook version 4 were used as the gold standard. Each AI model predicted the ESI level for each vignette. Performance metrics, including sensitivity, specificity, and F1 score, were calculated. Reliability was assessed by repeating the tests and measuring the interrater reliability using Fleiss kappa. Paired t tests were used to compare the models before and after fine-tuning.</p><p><strong>Results: </strong>Claude-3 Opus achieved the highest performance amongst the untrained models with a sensitivity of 80.6% (95% confidence interval [CI]: 63.6-90.7), specificity of 91.3% (95% CI: 83.8-99), and an F1 score of 73.9% (95% CI: 58.9-90.7). After fine-tuning, the GPT-4.0 model showed statistically significant improvement with a sensitivity of 77.1% (95% CI: 60.1-86.5), specificity of 92.5% (95% CI: 89.5-97.4), and an F1 score of 74.6% (95% CI: 63.9-83.8, P < 0.04). Reliability analysis revealed high agreement for Claude-3 Opus (Fleiss κ: 0.85), followed by Mistral-Large (Fleiss κ: 0.79) and trained GPT-4.0 (Fleiss κ: 0.67). Training improved the reliability of GPT models ( P < 0.001).</p><p><strong>Conclusions: </strong>Generative AI models demonstrate promising accuracy in predicting pediatric ESI levels, with fine-tuning significantly enhancing their performance and reliability. These findings suggest that AI could serve as a valuable tool in pediatric triage.</p>","PeriodicalId":19996,"journal":{"name":"Pediatric emergency care","volume":"41 4","pages":"251-255"},"PeriodicalIF":1.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric emergency care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/PEC.0000000000003315","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/7 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Evaluate the accuracy and reliability of various generative artificial intelligence (AI) models (ChatGPT-3.5, ChatGPT-4.0, T5, Llama-2, Mistral-Large, and Claude-3 Opus) in predicting Emergency Severity Index (ESI) levels for pediatric emergency department patients and assess the impact of medically oriented fine-tuning.
Methods: Seventy pediatric clinical vignettes from the ESI Handbook version 4 were used as the gold standard. Each AI model predicted the ESI level for each vignette. Performance metrics, including sensitivity, specificity, and F1 score, were calculated. Reliability was assessed by repeating the tests and measuring the interrater reliability using Fleiss kappa. Paired t tests were used to compare the models before and after fine-tuning.
Results: Claude-3 Opus achieved the highest performance amongst the untrained models with a sensitivity of 80.6% (95% confidence interval [CI]: 63.6-90.7), specificity of 91.3% (95% CI: 83.8-99), and an F1 score of 73.9% (95% CI: 58.9-90.7). After fine-tuning, the GPT-4.0 model showed statistically significant improvement with a sensitivity of 77.1% (95% CI: 60.1-86.5), specificity of 92.5% (95% CI: 89.5-97.4), and an F1 score of 74.6% (95% CI: 63.9-83.8, P < 0.04). Reliability analysis revealed high agreement for Claude-3 Opus (Fleiss κ: 0.85), followed by Mistral-Large (Fleiss κ: 0.79) and trained GPT-4.0 (Fleiss κ: 0.67). Training improved the reliability of GPT models ( P < 0.001).
Conclusions: Generative AI models demonstrate promising accuracy in predicting pediatric ESI levels, with fine-tuning significantly enhancing their performance and reliability. These findings suggest that AI could serve as a valuable tool in pediatric triage.
期刊介绍:
Pediatric Emergency Care®, features clinically relevant original articles with an EM perspective on the care of acutely ill or injured children and adolescents. The journal is aimed at both the pediatrician who wants to know more about treating and being compensated for minor emergency cases and the emergency physicians who must treat children or adolescents in more than one case in there.