Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE)

IF 2.6 3区医学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Journal of Surgical Education Pub Date : 2024-09-14 DOI:10.1016/j.jsurg.2024.08.002

Daniel S. Hayes BS, Brian K. Foster MD, Gabriel Makar MD, Shahid Manzar MD, MEng, Yagiz Ozdag MD, Mason Shultz BS, Joel C. Klena MD, Louis C. Grandizio DO

{"title":"Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE)","authors":"Daniel S. Hayes BS, Brian K. Foster MD, Gabriel Makar MD, Shahid Manzar MD, MEng, Yagiz Ozdag MD, Mason Shultz BS, Joel C. Klena MD, Louis C. Grandizio DO","doi":"10.1016/j.jsurg.2024.08.002","DOIUrl":null,"url":null,"abstract":"<div><h3>OBJECTIVE</h3><p>Artificial intelligence (AI) is capable of answering complex medical examination questions, offering the potential to revolutionize medical education and healthcare delivery. In this study we aimed to assess ChatGPT, a model that has demonstrated exceptional performance on standardized exams. Specifically, our focus was on evaluating ChatGPT's performance on the complete 2019 Orthopaedic In-Training Examination (OITE), including questions with an image component. Furthermore, we explored difference in performance when questions varied by text only or text with an associated image, including whether the image was described using AI or a trained orthopaedist.</p></div><div><h3>DESIGN And SETTING</h3><p>Questions from the 2019 OITE were input into ChatGPT version 4.0 (GPT-4) using 3 response variants. As the capacity to input or interpret images is not publicly available in ChatGPT at the time of this study, questions with an image component were described and added to the OITE question using descriptions generated by Microsoft Azure AI Vision Studio or authors of the study.</p></div><div><h3>RESULTS</h3><p>ChatGPT performed equally on OITE questions with or without imaging components, with an average correct answer choice of 49% and 48% across all 3 input methods. Performance dropped by 6% when using image descriptions generated by AI. When using single answer multiple-choice input methods, ChatGPT performed nearly double the rate of random guessing, answering 49% of questions correctly. The performance of ChatGPT was worse than all resident classes on the 2019 exam, scoring 4% lower than PGY-1 residents.</p></div><div><h3>DISCUSSION</h3><p>ChatGT performed below all resident classes on the 2019 OITE. Performance on text only questions and questions with images was nearly equal if the image was described by a trained orthopaedic specialist but decreased when using an AI generated description. Recognizing the performance abilities of AI software may provide insight into the current and future applications of this technology into medical education.</p></div>","PeriodicalId":50033,"journal":{"name":"Journal of Surgical Education","volume":"81 11","pages":"Pages 1645-1649"},"PeriodicalIF":2.6000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Surgical Education","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1931720424003799","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

OBJECTIVE

Artificial intelligence (AI) is capable of answering complex medical examination questions, offering the potential to revolutionize medical education and healthcare delivery. In this study we aimed to assess ChatGPT, a model that has demonstrated exceptional performance on standardized exams. Specifically, our focus was on evaluating ChatGPT's performance on the complete 2019 Orthopaedic In-Training Examination (OITE), including questions with an image component. Furthermore, we explored difference in performance when questions varied by text only or text with an associated image, including whether the image was described using AI or a trained orthopaedist.

DESIGN And SETTING

Questions from the 2019 OITE were input into ChatGPT version 4.0 (GPT-4) using 3 response variants. As the capacity to input or interpret images is not publicly available in ChatGPT at the time of this study, questions with an image component were described and added to the OITE question using descriptions generated by Microsoft Azure AI Vision Studio or authors of the study.

RESULTS

ChatGPT performed equally on OITE questions with or without imaging components, with an average correct answer choice of 49% and 48% across all 3 input methods. Performance dropped by 6% when using image descriptions generated by AI. When using single answer multiple-choice input methods, ChatGPT performed nearly double the rate of random guessing, answering 49% of questions correctly. The performance of ChatGPT was worse than all resident classes on the 2019 exam, scoring 4% lower than PGY-1 residents.

DISCUSSION

ChatGT performed below all resident classes on the 2019 OITE. Performance on text only questions and questions with images was nearly equal if the image was described by a trained orthopaedic specialist but decreased when using an AI generated description. Recognizing the performance abilities of AI software may provide insight into the current and future applications of this technology into medical education.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人工智能在骨科中的应用：ChatGPT 在完整的 AAOS 骨科在训考试 (OITE) 中的文本和图像问题上的表现

目的人工智能（AI）能够回答复杂的医学考试问题，具有彻底改变医学教育和医疗服务的潜力。在本研究中，我们旨在评估 ChatGPT，这是一个在标准化考试中表现优异的模型。具体来说，我们的重点是评估 ChatGPT 在完整的 2019 年骨科在岗培训考试（OITE）中的表现，包括带有图像成分的问题。此外，我们还探索了当问题仅由文本或文本与相关图像（包括是否使用人工智能或训练有素的骨科医生对图像进行描述）不同时的性能差异。由于在本研究进行时，ChatGPT 还未公开提供输入或解释图像的功能，因此使用 Microsoft Azure AI Vision Studio 或本研究作者生成的描述，将带有图像组件的问题添加到 OITE 问题中。使用人工智能生成的图像描述时，成绩下降了 6%。在使用单项答案多选输入法时，ChatGPT 的表现几乎是随机猜测的两倍，正确回答了 49% 的问题。在 2019 年的考试中，ChatGPT 的表现比所有住院医师班级都要差，得分比 PGY-1 住院医师低 4%.讨论ChatGPT 在 2019 年 OITE 考试中的表现低于所有住院医师班级。如果图像是由受过培训的骨科专家描述的，那么仅有文字的问题和有图像的问题的成绩几乎相等，但如果使用人工智能生成的描述，成绩就会下降。认识到人工智能软件的表现能力，可以让我们深入了解这项技术在医学教育中的当前和未来应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Surgical Education EDUCATION, SCIENTIFIC DISCIPLINES-SURGERY

CiteScore

5.60

自引率

10.30%

发文量

261

审稿时长

48 days

期刊介绍： The Journal of Surgical Education (JSE) is dedicated to advancing the field of surgical education through original research. The journal publishes research articles in all surgical disciplines on topics relative to the education of surgical students, residents, and fellows, as well as practicing surgeons. Our readers look to JSE for timely, innovative research findings from the international surgical education community. As the official journal of the Association of Program Directors in Surgery (APDS), JSE publishes the proceedings of the annual APDS meeting held during Surgery Education Week.