Evaluation of Chat Generative Pre-trained Transformer and Microsoft Copilot Performance on the American Society of Surgery of the Hand Self-Assessment Examinations

Q3 Medicine Journal of Hand Surgery Global Online Pub Date : 2025-01-01 DOI:10.1016/j.jhsg.2024.10.001

Taylor R. Rakauskas BS , Antonio Da Costa BS , Camberly Moriconi BS , Gurnoor Gill BA , Jeffrey W. Kwong MD MS , Nicolas Lee MD

{"title":"Evaluation of Chat Generative Pre-trained Transformer and Microsoft Copilot Performance on the American Society of Surgery of the Hand Self-Assessment Examinations","authors":"Taylor R. Rakauskas BS , Antonio Da Costa BS , Camberly Moriconi BS , Gurnoor Gill BA , Jeffrey W. Kwong MD MS , Nicolas Lee MD","doi":"10.1016/j.jhsg.2024.10.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>Artificial intelligence advancements have the potential to transform medical education and patient care. The increasing popularity of large language models has raised important questions regarding their accuracy and agreement with human users. The purpose of this study was to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT), versions 3.5 and 4, as well as Microsoft Copilot, which is powered by ChatGPT-4, on self-assessment examination questions for hand surgery and compare results between versions.</div></div><div><h3>Methods</h3><div>Input included 1,000 questions across 5 years (2015–2019) of self-assessment examinations provided by the American Society of Surgery of the Hand. The primary outcomes included correctness, the percentage concordance relative to other users, and whether an additional prompt was required. Secondary outcomes included accuracy according to question type and difficulty.</div></div><div><h3>Results</h3><div>All question formats including image-based questions were used for the analysis. ChatGPT-3.5 correctly answered 51.6% and ChatGPT-4 correctly answered 63.4%, which was a statistically significant difference. Microsoft Copilot correctly answered 59.9% and outperformed ChatGPT-3.5 but scored significantly lower than ChatGPT-4. However, ChatGPT-3.5 sided with an average of 72.2% users when correct and 62.1% when incorrect, compared to an average of 67.0% and 53.2% users, respectively, for ChatGPT-4. Microsoft Copilot sided with an average of 79.7% users when correct and 52.1% when incorrect. The highest scoring subject was <em>Miscellaneous</em>, and the lowest scoring subject was <em>Neuromuscular</em> in all versions.</div></div><div><h3>Conclusions</h3><div>In this study, ChatGPT-4 and Microsoft Copilot perform better on the hand surgery subspecialty examinations than did ChatGPT-3.5. Microsoft Copilot was more accurate than ChatGPT3.5 but less accurate than ChatGPT4. The ChatGPT-4 and Microsoft Copilot were able to “pass” the 2015–2019 American Society for Surgery of the Hand self-assessment examinations.</div></div><div><h3>Clinical Relevance</h3><div>While holding promise within medical education, caution should be used with large language models as more detailed evaluation of consistency is needed. Future studies should explore how these models perform across multiple trials and contexts to truly assess their reliability.</div></div>","PeriodicalId":36920,"journal":{"name":"Journal of Hand Surgery Global Online","volume":"7 1","pages":"Pages 23-28"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hand Surgery Global Online","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589514124001907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

Artificial intelligence advancements have the potential to transform medical education and patient care. The increasing popularity of large language models has raised important questions regarding their accuracy and agreement with human users. The purpose of this study was to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT), versions 3.5 and 4, as well as Microsoft Copilot, which is powered by ChatGPT-4, on self-assessment examination questions for hand surgery and compare results between versions.

Methods

Input included 1,000 questions across 5 years (2015–2019) of self-assessment examinations provided by the American Society of Surgery of the Hand. The primary outcomes included correctness, the percentage concordance relative to other users, and whether an additional prompt was required. Secondary outcomes included accuracy according to question type and difficulty.

Results

All question formats including image-based questions were used for the analysis. ChatGPT-3.5 correctly answered 51.6% and ChatGPT-4 correctly answered 63.4%, which was a statistically significant difference. Microsoft Copilot correctly answered 59.9% and outperformed ChatGPT-3.5 but scored significantly lower than ChatGPT-4. However, ChatGPT-3.5 sided with an average of 72.2% users when correct and 62.1% when incorrect, compared to an average of 67.0% and 53.2% users, respectively, for ChatGPT-4. Microsoft Copilot sided with an average of 79.7% users when correct and 52.1% when incorrect. The highest scoring subject was Miscellaneous, and the lowest scoring subject was Neuromuscular in all versions.

Conclusions

In this study, ChatGPT-4 and Microsoft Copilot perform better on the hand surgery subspecialty examinations than did ChatGPT-3.5. Microsoft Copilot was more accurate than ChatGPT3.5 but less accurate than ChatGPT4. The ChatGPT-4 and Microsoft Copilot were able to “pass” the 2015–2019 American Society for Surgery of the Hand self-assessment examinations.

Clinical Relevance

While holding promise within medical education, caution should be used with large language models as more detailed evaluation of consistency is needed. Future studies should explore how these models perform across multiple trials and contexts to truly assess their reliability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

聊天生成预训练变压器和微软副驾驶在美国手外科学会自评考试中的表现评估

人工智能的进步有可能改变医学教育和患者护理。大型语言模型的日益流行提出了关于它们的准确性和与人类用户的一致性的重要问题。本研究的目的是评估聊天生成预训练转换器（ChatGPT） 3.5和4版本，以及由ChatGPT-4驱动的Microsoft Copilot在手外科自评考试问题上的表现，并比较版本之间的结果。方法收集美国手外科学会2015-2019年5年自评考试试题1000道。主要结果包括正确性，相对于其他用户的一致性百分比，以及是否需要额外的提示。次要结果包括问题类型和难度的准确性。结果采用所有题型进行分析，包括基于图像的题型。ChatGPT-3.5正确率为51.6%，ChatGPT-4正确率为63.4%，差异有统计学意义。微软Copilot的正确率为59.9%，优于ChatGPT-3.5，但得分明显低于ChatGPT-4。然而，ChatGPT-3.5在正确时平均支持72.2%，在错误时平均支持62.1%，而ChatGPT-4的平均支持率分别为67.0%和53.2%。微软副驾驶在正确情况下平均支持79.7%的用户，在错误情况下平均支持52.1%的用户。在所有版本中，得分最高的科目是“杂项”，得分最低的科目是“神经肌肉”。结论在本研究中，ChatGPT-4和Microsoft Copilot在手外科亚专科检查中的表现优于ChatGPT-3.5。微软副驾驶比ChatGPT3.5更准确，但不如ChatGPT4准确。ChatGPT-4和微软副驾驶能够“通过”2015-2019年美国手部外科学会自我评估考试。临床相关性虽然在医学教育中有希望，但对于大型语言模型应谨慎使用，因为需要对一致性进行更详细的评估。未来的研究应该探索这些模型在多个试验和背景下的表现，以真正评估它们的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊