Sriram Ramgopal, Selina Varma, Jillian K Gorski, Kristen M Kester, Andrew Shieh, Srinivasan Suresh
{"title":"Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank.","authors":"Sriram Ramgopal, Selina Varma, Jillian K Gorski, Kristen M Kester, Andrew Shieh, Srinivasan Suresh","doi":"10.1097/PEC.0000000000003271","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM).</p><p><strong>Methods: </strong>We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' κ.</p><p><strong>Results: </strong>We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%-80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' κ across the 3 iterations was 0.71, indicating substantial agreement.</p><p><strong>Conclusion: </strong>ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.</p>","PeriodicalId":19996,"journal":{"name":"Pediatric emergency care","volume":"40 12","pages":"871-875"},"PeriodicalIF":1.2000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric emergency care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/PEC.0000000000003271","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM).
Methods: We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' κ.
Results: We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%-80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' κ across the 3 iterations was 0.71, indicating substantial agreement.
Conclusion: ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.
期刊介绍:
Pediatric Emergency Care®, features clinically relevant original articles with an EM perspective on the care of acutely ill or injured children and adolescents. The journal is aimed at both the pediatrician who wants to know more about treating and being compensated for minor emergency cases and the emergency physicians who must treat children or adolescents in more than one case in there.