Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank.

IF 1.2 4区医学 Q3 EMERGENCY MEDICINE Pediatric emergency care Pub Date : 2024-12-01 DOI:10.1097/PEC.0000000000003271

Sriram Ramgopal, Selina Varma, Jillian K Gorski, Kristen M Kester, Andrew Shieh, Srinivasan Suresh

{"title":"Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank.","authors":"Sriram Ramgopal, Selina Varma, Jillian K Gorski, Kristen M Kester, Andrew Shieh, Srinivasan Suresh","doi":"10.1097/PEC.0000000000003271","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM).Methods: We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' κ.Results: We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%-80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' κ across the 3 iterations was 0.71, indicating substantial agreement.Conclusion: ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.","PeriodicalId":19996,"journal":{"name":"Pediatric emergency care","volume":"40 12","pages":"871-875"},"PeriodicalIF":1.2000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric emergency care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/PEC.0000000000003271","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM).

Methods: We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' κ.

Results: We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%-80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' κ across the 3 iterations was 0.71, indicating substantial agreement.

Conclusion: ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估美国儿科学会 PREP 急诊医学题库的大型语言模型。

背景大型语言模型（LLM），包括 ChatGPT（聊天生成预训练转换器）--一种流行的、公开可用的 LLM，代表了人工智能应用领域的一项重要创新。这些系统根据用户输入的不同主题，通过识别大型文本数据集中的模式来生成相关内容。我们试图评估 ChatGPT 在旨在评估儿科急诊医学（PEM）知识能力的实践测试题中的表现：我们使用 2022 年至 2024 年间发布的儿科急诊医学委员会认证常用题库，评估了 ChatGPT 在儿科急诊医学委员会认证中的表现。临床医生通过输入提示和记录软件的回答来评估 ChatGPT 的性能，每个问题分别进行 3 次迭代。我们计算了正确答案百分比（定义为至少 2/3 次迭代中正确），并使用 Fleiss' κ 评估了迭代之间的一致性：在 3 个研究年度中，我们共纳入了 215 个问题。ChatGPT 在 3 年中正确回答了 161 个 PREP EM 问题（74.5%；95% 置信区间，68.5%-80.5%），各研究年的正确率相似（2022、2023 和 2024 研究年的正确率分别为 75.0%、71.8% 和 77.8%）。在正确答案中，大多数答案在所有 3 次迭代中均回答正确（137/161，85.1%）。不同题目的答题情况各不相同，研究和医学专业的答题得分最高，程序和毒理学的答题得分较低。3 次迭代中的弗莱斯κ值为 0.71，表明结果基本一致：结论：ChatGPT 在四分之三的情况下对 PEM 回答提供了正确答案，超过了问题出版商建议的 65% 的最低通过率。ChatGPT 的回答包括详细的解释，这表明它有可能用于医学教育。我们发现了特定主题和图像解读方面的局限性。这些结果表明，LLM 有机会加强 PEM 的教育和临床实践。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pediatric emergency care 医学-急救医学

CiteScore

2.40

自引率

14.30%

发文量

577

审稿时长

3-6 weeks

期刊介绍： Pediatric Emergency Care®, features clinically relevant original articles with an EM perspective on the care of acutely ill or injured children and adolescents. The journal is aimed at both the pediatrician who wants to know more about treating and being compensated for minor emergency cases and the emergency physicians who must treat children or adolescents in more than one case in there.