Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES JMIR Medical Education Pub Date : 2025-03-21 DOI:10.2196/58375

Julian Madrid, Philipp Diehl, Mischa Selig, Bernd Rolauffs, Felix Patricius Hans, Hans-Jörg Busch, Tobias Scheef, Leo Benning

{"title":"Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination.","authors":"Julian Madrid, Philipp Diehl, Mischa Selig, Bernd Rolauffs, Felix Patricius Hans, Hans-Jörg Busch, Tobias Scheef, Leo Benning","doi":"10.2196/58375","DOIUrl":null,"url":null,"abstract":"Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations.Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed \"confidence accuracy\" to evaluate it.Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings.Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers.Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"11 ","pages":"e58375"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951815/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/58375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations.

Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed "confidence accuracy" to evaluate it.

Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings.

Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers.

Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

插件增强ChatGPT的性能及其量化不确定性的能力：对德国医学委员会考试的模拟研究。

背景：GPT-4是在广泛的数据集上训练和微调的大型语言模型（LLM）。在其前身于2022年11月公开发布之后，法学硕士的使用引起了人们的极大兴趣，并提出了许多潜在的用例。但与此同时，也提出了一些重要的限制。特别是，当前的法学硕士遇到的限制，特别是在符号表示和访问当代数据。最近版本的GPT-4，以及新发布的插件功能，已经被引入以减轻这些限制。目的：在此背景之前，本工作旨在调查GPT-3.5、GPT-4、GPT-4带插件和GPT-4带插件使用预翻译的英语文本在德国医学委员会考试中的表现。认识到量化不确定性对于医学法学硕士应用的重要性，我们进一步评估了这种能力，并开发了一个称为“置信度准确性”的新度量来评估它。方法：采用GPT-3.5、GPT-4、GPT-4带插件、GPT-4带插件并翻译的方法回答德国医学委员会考试中的问题。此外，我们进行了分析，以评估模型如何证明他们的答案，他们的回答的准确性，以及他们的答案的错误结构。我们使用Bootstrapping和ci来评估我们的发现的统计学意义。结果：本研究表明，现有的GPT模型，作为法学硕士的例子，超过了德国医学委员会为医学生获得委员会认证的最低能力门槛。此外，这些模型可以评估他们反应中的不确定性，尽管表现出过度自信。此外，这项工作还揭示了GPT生成答案时出现的某些理由和推理结构。结论：GPTs在回答医学问题方面的高性能使其在学术界和潜在的临床实践中得到很好的应用。它能够量化答案中的不确定性，这表明它可能是临床决策循环中有价值的人工智能代理。然而，在人工智能代理能够在医疗领域得到稳健和安全的实施之前，必须解决重大挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊