Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty.

Diabetes care Pub Date : 2025-02-01 DOI:10.2337/dc24-1067
James H Flory, Jessica S Ancker, Scott Y H Kim, Gilad Kuperman, Aleksandr Petrov, Andrew Vickers
{"title":"Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty.","authors":"James H Flory, Jessica S Ancker, Scott Y H Kim, Gilad Kuperman, Aleksandr Petrov, Andrew Vickers","doi":"10.2337/dc24-1067","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To explore how the commercially available large language model (LLM) GPT-4 compares to endocrinologists when addressing medical questions when there is uncertainty regarding the best answer.</p><p><strong>Research design and methods: </strong>This study compared responses from GPT-4 to responses from 31 endocrinologists using hypothetical clinical vignettes focused on diabetes, specifically examining the prescription of metformin versus alternative treatments. The primary outcome was the choice between metformin and other treatments.</p><p><strong>Results: </strong>With a simple prompt, GPT-4 chose metformin in 12% (95% CI 7.9-17%) of responses, compared with 31% (95% CI 23-39%) of endocrinologist responses. After modifying the prompt to encourage metformin use, the selection of metformin by GPT-4 increased to 25% (95% CI 22-28%). GPT-4 rarely selected metformin in patients with impaired kidney function, or a history of gastrointestinal distress (2.9% of responses, 95% CI 1.4-5.5%). In contrast, endocrinologists often prescribed metformin even in patients with a history of gastrointestinal distress (21% of responses, 95% CI 12-36%). GPT-4 responses showed low variability on repeated runs except at intermediate levels of kidney function.</p><p><strong>Conclusions: </strong>In clinical scenarios with no single right answer, GPT-4's responses were reasonable, but differed from endocrinologists' responses in clinically important ways. Value judgments are needed to determine when these differences should be addressed by adjusting the model. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences, or it demonstrates improvement in clinical outcomes over standard of care.</p>","PeriodicalId":93979,"journal":{"name":"Diabetes care","volume":" ","pages":"185-192"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11770168/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diabetes care","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2337/dc24-1067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To explore how the commercially available large language model (LLM) GPT-4 compares to endocrinologists when addressing medical questions when there is uncertainty regarding the best answer.

Research design and methods: This study compared responses from GPT-4 to responses from 31 endocrinologists using hypothetical clinical vignettes focused on diabetes, specifically examining the prescription of metformin versus alternative treatments. The primary outcome was the choice between metformin and other treatments.

Results: With a simple prompt, GPT-4 chose metformin in 12% (95% CI 7.9-17%) of responses, compared with 31% (95% CI 23-39%) of endocrinologist responses. After modifying the prompt to encourage metformin use, the selection of metformin by GPT-4 increased to 25% (95% CI 22-28%). GPT-4 rarely selected metformin in patients with impaired kidney function, or a history of gastrointestinal distress (2.9% of responses, 95% CI 1.4-5.5%). In contrast, endocrinologists often prescribed metformin even in patients with a history of gastrointestinal distress (21% of responses, 95% CI 12-36%). GPT-4 responses showed low variability on repeated runs except at intermediate levels of kidney function.

Conclusions: In clinical scenarios with no single right answer, GPT-4's responses were reasonable, but differed from endocrinologists' responses in clinically important ways. Value judgments are needed to determine when these differences should be addressed by adjusting the model. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences, or it demonstrates improvement in clinical outcomes over standard of care.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大语言模型 GPT-4 与内分泌科医生在临床不确定情况下对初始抗糖尿病药物选择的反应进行比较。
目的:探讨在最佳答案不确定的情况下,商用大型语言模型(LLM)GPT-4 如何与内分泌专家进行比较:探讨当最佳答案不确定时,商用大语言模型(LLM)GPT-4 与内分泌专家在处理医疗问题时的比较:本研究利用糖尿病的假设临床案例,将 GPT-4 的回答与 31 位内分泌专家的回答进行了比较,特别考察了二甲双胍处方与替代疗法的比较。主要结果是在二甲双胍和其他治疗方法之间做出选择:结果:在简单的提示下,GPT-4 选择二甲双胍的比例为 12% (95% CI 7.9-17%),而内分泌科医生选择二甲双胍的比例为 31% (95% CI 23-39%)。在修改提示以鼓励使用二甲双胍后,GPT-4 选择二甲双胍的比例增至 25% (95% CI 22-28%)。对于肾功能受损或有胃肠道不适病史的患者,GPT-4 很少选择二甲双胍(2.9% 的回复,95% CI 1.4-5.5%)。与此相反,即使是有胃肠道不适病史的患者,内分泌专家也经常给他们开二甲双胍(21% 的应答,95% CI 12-36%)。除肾功能处于中等水平的患者外,GPT-4反应在重复运行中的变异性较低:结论:在没有单一正确答案的临床情景中,GPT-4 的回答是合理的,但与内分泌专家的回答在临床上存在重要差异。需要进行价值判断,以确定何时应通过调整模型来解决这些差异。我们建议不要依赖 LLM 输出,除非它不仅符合临床指南,还符合患者和临床医生的偏好,或者它证明临床结果比标准护理有所改善。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
29.50
自引率
0.00%
发文量
0
期刊最新文献
Association of Historical Redlining With Gestational Diabetes Mellitus: The Mediating Role of BMI and Area Deprivation Index. Sex-Specific Blood Pressure Trajectories and Cardiovascular Disease in Type 1 Diabetes: 32-Year Follow-up of the Pittsburgh Epidemiology of Diabetes Complications Cohort. Understanding the Impact of Diabetic Peripheral Neuropathy and Neuropathic Pain on Quality of Life and Mental Health in 6,960 People With Diabetes. Repeated OGTT Versus Continuous Glucose Monitoring for Predicting Development of Stage 3 Type 1 Diabetes: A Longitudinal Analysis. Diabetes Body Project: Acute Effects of an Eating Disorder Prevention Program for Young Women With Type 1 Diabetes. A Multinational Randomized Controlled Trial.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1