The performance of large language models on fictional consult queries indicates favorable potential for AI-assisted vascular surgery consult handling

Quang Le BS , Kedar S. Lavingia MD , Michael Amendola MD, MEHP
{"title":"The performance of large language models on fictional consult queries indicates favorable potential for AI-assisted vascular surgery consult handling","authors":"Quang Le BS ,&nbsp;Kedar S. Lavingia MD ,&nbsp;Michael Amendola MD, MEHP","doi":"10.1016/j.jvsvi.2023.100052","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Recently, the use of large language models (LLMs) in medicine has become a prominent topic of discussion due to the rapid improvement of these tools in understanding and responding to natural language. Several models are widely available to the public, both proprietary and open-sourced. We aim to evaluate the possible use of such LLMs in vascular surgery by understanding their abilities to process common consult requests.</p></div><div><h3>Methods</h3><p>The senior author created 25 fictional vascular surgery consultation queries based on common consultation requests. Five attending surgeons and four LLMs (GPT 3.5, GPT 4, Bard, and Falcon 40B) were asked to answer whether each consult was an emergency that needed immediate attention within an hour. Responders were also asked whether the next best step was an examination, additional imaging, or an urgent operation. GPT 3.5 and 4 also provided free-response answers on the next best step, graded by attending surgeons based on scientific accuracy, possible harm, and content completeness.</p></div><div><h3>Results</h3><p>The rates of accurate emergency identification were 88%, 100%, 76%, and 88% for GPT 3.5, GPT 4, Falcon 40B, and Bard, respectively. Although they have similar overall accuracy, GPT 3.5 has a high sensitivity at 100%, whereas Bard has a high specificity at 90%. GPT 4.0 had 100% sensitivity and specificity. LLMs agreed with the majority surgeon opinion on the next best step in 64% (GPT 3.5), 32% (GPT 4), 68% (Falcon 40B), and 36% (Bard) of cases. GPT 3.5 and 4 had a collective ratio of 89.5% of answers adhering to the scientific consensus. Only 5% of responses were highly likely to cause clinically significant harm. Although only 4% included incorrect content, 17.5% of answers missed important content. There was no significant difference between GPT 3.5 and 4 regarding the free-response grade.</p></div><div><h3>Conclusions</h3><p>Existing, widely available LLMs exhibited a solid ability to identify vascular emergencies, with GPT 4.0 agreeing with surgeon attendings in 100% of cases. However, these models continue to have identifiable deficiencies in treatment recommendations, a higher-level task. Future models might help triage incoming consults and provide preliminary management suggestions. The utility of such tools in clinical practice remains to be explored.</p></div>","PeriodicalId":74034,"journal":{"name":"JVS-vascular insights","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949912723000491/pdfft?md5=d438b8e4aee6234325d2f144047a04fb&pid=1-s2.0-S2949912723000491-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JVS-vascular insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949912723000491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

Recently, the use of large language models (LLMs) in medicine has become a prominent topic of discussion due to the rapid improvement of these tools in understanding and responding to natural language. Several models are widely available to the public, both proprietary and open-sourced. We aim to evaluate the possible use of such LLMs in vascular surgery by understanding their abilities to process common consult requests.

Methods

The senior author created 25 fictional vascular surgery consultation queries based on common consultation requests. Five attending surgeons and four LLMs (GPT 3.5, GPT 4, Bard, and Falcon 40B) were asked to answer whether each consult was an emergency that needed immediate attention within an hour. Responders were also asked whether the next best step was an examination, additional imaging, or an urgent operation. GPT 3.5 and 4 also provided free-response answers on the next best step, graded by attending surgeons based on scientific accuracy, possible harm, and content completeness.

Results

The rates of accurate emergency identification were 88%, 100%, 76%, and 88% for GPT 3.5, GPT 4, Falcon 40B, and Bard, respectively. Although they have similar overall accuracy, GPT 3.5 has a high sensitivity at 100%, whereas Bard has a high specificity at 90%. GPT 4.0 had 100% sensitivity and specificity. LLMs agreed with the majority surgeon opinion on the next best step in 64% (GPT 3.5), 32% (GPT 4), 68% (Falcon 40B), and 36% (Bard) of cases. GPT 3.5 and 4 had a collective ratio of 89.5% of answers adhering to the scientific consensus. Only 5% of responses were highly likely to cause clinically significant harm. Although only 4% included incorrect content, 17.5% of answers missed important content. There was no significant difference between GPT 3.5 and 4 regarding the free-response grade.

Conclusions

Existing, widely available LLMs exhibited a solid ability to identify vascular emergencies, with GPT 4.0 agreeing with surgeon attendings in 100% of cases. However, these models continue to have identifiable deficiencies in treatment recommendations, a higher-level task. Future models might help triage incoming consults and provide preliminary management suggestions. The utility of such tools in clinical practice remains to be explored.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大语言模型在虚构咨询问答中的表现显示了人工智能辅助血管外科咨询处理的巨大潜力
目的最近,由于大型语言模型(LLMs)在理解和响应自然语言方面的快速进步,这些工具在医学中的应用已成为一个突出的讨论话题。一些模型已广泛向公众开放,既有专有的,也有开源的。我们的目的是通过了解这些 LLM 处理常见会诊请求的能力,来评估在血管外科中使用这些 LLM 的可能性。要求五名主治外科医生和四名 LLM(GPT 3.5、GPT 4、Bard 和 Falcon 40B)回答每个会诊是否属于需要在一小时内立即处理的急诊。回答者还被问及下一个最佳步骤是检查、补充成像还是紧急手术。GPT 3.5 和 GPT 4 还提供了关于下一个最佳步骤的自由回答,由主治外科医生根据科学准确性、可能的危害性和内容完整性进行评分。结果 GPT 3.5、GPT 4、Falcon 40B 和 Bard 的急诊识别准确率分别为 88%、100%、76% 和 88%。虽然它们的总体准确率相似,但 GPT 3.5 的灵敏度较高,为 100%,而 Bard 的特异性较高,为 90%。GPT 4.0 的敏感性和特异性均为 100%。在 64% 的病例(GPT 3.5)、32% 的病例(GPT 4)、68% 的病例(Falcon 40B)和 36% 的病例(Bard)中,LLM 同意大多数外科医生关于下一个最佳步骤的意见。在 GPT 3.5 和 4 中,89.5% 的答案符合科学共识。只有 5% 的答案极有可能造成临床重大伤害。虽然只有 4% 的答案包含错误内容,但仍有 17.5% 的答案遗漏了重要内容。结论现有的、广泛使用的 LLMs 在识别血管急症方面表现出了很强的能力,GPT 4.0 与外科医生主治医生的意见一致率为 100%。然而,这些模型在治疗建议这一更高层次的任务上仍存在明显不足。未来的模型可能会帮助对前来咨询的病人进行分流,并提供初步的治疗建议。此类工具在临床实践中的实用性仍有待探索。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Regarding “Intravascular Ultrasound Use in Peripheral Arterial and Deep Venous Interventions: Multidisciplinary Expert Opinion from SCAI/AVF/AVLS/SIR/SVM/SVS” An Assessment of Racial Diversity in Vascular Surgery Educational Resources The use of artificial intelligence in three-dimensional imaging modalities and diabetic foot disease – a systematic review Room for improvement in patient compliance during peripheral vascular interventions Reply
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1