Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty.

Diabetes care Pub Date : 2025-02-01 DOI:10.2337/dc24-1067

James H Flory, Jessica S Ancker, Scott Y H Kim, Gilad Kuperman, Aleksandr Petrov, Andrew Vickers

{"title":"Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty.","authors":"James H Flory, Jessica S Ancker, Scott Y H Kim, Gilad Kuperman, Aleksandr Petrov, Andrew Vickers","doi":"10.2337/dc24-1067","DOIUrl":null,"url":null,"abstract":"Objective: To explore how the commercially available large language model (LLM) GPT-4 compares to endocrinologists when addressing medical questions when there is uncertainty regarding the best answer.Research design and methods: This study compared responses from GPT-4 to responses from 31 endocrinologists using hypothetical clinical vignettes focused on diabetes, specifically examining the prescription of metformin versus alternative treatments. The primary outcome was the choice between metformin and other treatments.Results: With a simple prompt, GPT-4 chose metformin in 12% (95% CI 7.9-17%) of responses, compared with 31% (95% CI 23-39%) of endocrinologist responses. After modifying the prompt to encourage metformin use, the selection of metformin by GPT-4 increased to 25% (95% CI 22-28%). GPT-4 rarely selected metformin in patients with impaired kidney function, or a history of gastrointestinal distress (2.9% of responses, 95% CI 1.4-5.5%). In contrast, endocrinologists often prescribed metformin even in patients with a history of gastrointestinal distress (21% of responses, 95% CI 12-36%). GPT-4 responses showed low variability on repeated runs except at intermediate levels of kidney function.Conclusions: In clinical scenarios with no single right answer, GPT-4's responses were reasonable, but differed from endocrinologists' responses in clinically important ways. Value judgments are needed to determine when these differences should be addressed by adjusting the model. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences, or it demonstrates improvement in clinical outcomes over standard of care.","PeriodicalId":93979,"journal":{"name":"Diabetes care","volume":" ","pages":"185-192"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11770168/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diabetes care","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2337/dc24-1067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To explore how the commercially available large language model (LLM) GPT-4 compares to endocrinologists when addressing medical questions when there is uncertainty regarding the best answer.

Research design and methods: This study compared responses from GPT-4 to responses from 31 endocrinologists using hypothetical clinical vignettes focused on diabetes, specifically examining the prescription of metformin versus alternative treatments. The primary outcome was the choice between metformin and other treatments.

Results: With a simple prompt, GPT-4 chose metformin in 12% (95% CI 7.9-17%) of responses, compared with 31% (95% CI 23-39%) of endocrinologist responses. After modifying the prompt to encourage metformin use, the selection of metformin by GPT-4 increased to 25% (95% CI 22-28%). GPT-4 rarely selected metformin in patients with impaired kidney function, or a history of gastrointestinal distress (2.9% of responses, 95% CI 1.4-5.5%). In contrast, endocrinologists often prescribed metformin even in patients with a history of gastrointestinal distress (21% of responses, 95% CI 12-36%). GPT-4 responses showed low variability on repeated runs except at intermediate levels of kidney function.

Conclusions: In clinical scenarios with no single right answer, GPT-4's responses were reasonable, but differed from endocrinologists' responses in clinically important ways. Value judgments are needed to determine when these differences should be addressed by adjusting the model. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences, or it demonstrates improvement in clinical outcomes over standard of care.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大语言模型 GPT-4 与内分泌科医生在临床不确定情况下对初始抗糖尿病药物选择的反应进行比较。

目的：探讨在最佳答案不确定的情况下，商用大型语言模型（LLM）GPT-4 如何与内分泌专家进行比较：探讨当最佳答案不确定时，商用大语言模型（LLM）GPT-4 与内分泌专家在处理医疗问题时的比较：本研究利用糖尿病的假设临床案例，将 GPT-4 的回答与 31 位内分泌专家的回答进行了比较，特别考察了二甲双胍处方与替代疗法的比较。主要结果是在二甲双胍和其他治疗方法之间做出选择：结果：在简单的提示下，GPT-4 选择二甲双胍的比例为 12% (95% CI 7.9-17%)，而内分泌科医生选择二甲双胍的比例为 31% (95% CI 23-39%)。在修改提示以鼓励使用二甲双胍后，GPT-4 选择二甲双胍的比例增至 25% (95% CI 22-28%)。对于肾功能受损或有胃肠道不适病史的患者，GPT-4 很少选择二甲双胍（2.9% 的回复，95% CI 1.4-5.5%）。与此相反，即使是有胃肠道不适病史的患者，内分泌专家也经常给他们开二甲双胍（21% 的应答，95% CI 12-36%）。除肾功能处于中等水平的患者外，GPT-4反应在重复运行中的变异性较低：结论：在没有单一正确答案的临床情景中，GPT-4 的回答是合理的，但与内分泌专家的回答在临床上存在重要差异。需要进行价值判断，以确定何时应通过调整模型来解决这些差异。我们建议不要依赖 LLM 输出，除非它不仅符合临床指南，还符合患者和临床医生的偏好，或者它证明临床结果比标准护理有所改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Diabetes care

CiteScore

29.50

自引率

0.00%

发文量