The performance of large language models on fictional consult queries indicates favorable potential for AI-assisted vascular surgery consult handling

JVS-vascular insights Pub Date : 2024-01-01 DOI:10.1016/j.jvsvi.2023.100052

Quang Le BS , Kedar S. Lavingia MD , Michael Amendola MD, MEHP

{"title":"The performance of large language models on fictional consult queries indicates favorable potential for AI-assisted vascular surgery consult handling","authors":"Quang Le BS , Kedar S. Lavingia MD , Michael Amendola MD, MEHP","doi":"10.1016/j.jvsvi.2023.100052","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Recently, the use of large language models (LLMs) in medicine has become a prominent topic of discussion due to the rapid improvement of these tools in understanding and responding to natural language. Several models are widely available to the public, both proprietary and open-sourced. We aim to evaluate the possible use of such LLMs in vascular surgery by understanding their abilities to process common consult requests.</p></div><div><h3>Methods</h3><p>The senior author created 25 fictional vascular surgery consultation queries based on common consultation requests. Five attending surgeons and four LLMs (GPT 3.5, GPT 4, Bard, and Falcon 40B) were asked to answer whether each consult was an emergency that needed immediate attention within an hour. Responders were also asked whether the next best step was an examination, additional imaging, or an urgent operation. GPT 3.5 and 4 also provided free-response answers on the next best step, graded by attending surgeons based on scientific accuracy, possible harm, and content completeness.</p></div><div><h3>Results</h3><p>The rates of accurate emergency identification were 88%, 100%, 76%, and 88% for GPT 3.5, GPT 4, Falcon 40B, and Bard, respectively. Although they have similar overall accuracy, GPT 3.5 has a high sensitivity at 100%, whereas Bard has a high specificity at 90%. GPT 4.0 had 100% sensitivity and specificity. LLMs agreed with the majority surgeon opinion on the next best step in 64% (GPT 3.5), 32% (GPT 4), 68% (Falcon 40B), and 36% (Bard) of cases. GPT 3.5 and 4 had a collective ratio of 89.5% of answers adhering to the scientific consensus. Only 5% of responses were highly likely to cause clinically significant harm. Although only 4% included incorrect content, 17.5% of answers missed important content. There was no significant difference between GPT 3.5 and 4 regarding the free-response grade.</p></div><div><h3>Conclusions</h3><p>Existing, widely available LLMs exhibited a solid ability to identify vascular emergencies, with GPT 4.0 agreeing with surgeon attendings in 100% of cases. However, these models continue to have identifiable deficiencies in treatment recommendations, a higher-level task. Future models might help triage incoming consults and provide preliminary management suggestions. The utility of such tools in clinical practice remains to be explored.</p></div>","PeriodicalId":74034,"journal":{"name":"JVS-vascular insights","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949912723000491/pdfft?md5=d438b8e4aee6234325d2f144047a04fb&pid=1-s2.0-S2949912723000491-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JVS-vascular insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949912723000491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Recently, the use of large language models (LLMs) in medicine has become a prominent topic of discussion due to the rapid improvement of these tools in understanding and responding to natural language. Several models are widely available to the public, both proprietary and open-sourced. We aim to evaluate the possible use of such LLMs in vascular surgery by understanding their abilities to process common consult requests.

Methods

The senior author created 25 fictional vascular surgery consultation queries based on common consultation requests. Five attending surgeons and four LLMs (GPT 3.5, GPT 4, Bard, and Falcon 40B) were asked to answer whether each consult was an emergency that needed immediate attention within an hour. Responders were also asked whether the next best step was an examination, additional imaging, or an urgent operation. GPT 3.5 and 4 also provided free-response answers on the next best step, graded by attending surgeons based on scientific accuracy, possible harm, and content completeness.

Results

The rates of accurate emergency identification were 88%, 100%, 76%, and 88% for GPT 3.5, GPT 4, Falcon 40B, and Bard, respectively. Although they have similar overall accuracy, GPT 3.5 has a high sensitivity at 100%, whereas Bard has a high specificity at 90%. GPT 4.0 had 100% sensitivity and specificity. LLMs agreed with the majority surgeon opinion on the next best step in 64% (GPT 3.5), 32% (GPT 4), 68% (Falcon 40B), and 36% (Bard) of cases. GPT 3.5 and 4 had a collective ratio of 89.5% of answers adhering to the scientific consensus. Only 5% of responses were highly likely to cause clinically significant harm. Although only 4% included incorrect content, 17.5% of answers missed important content. There was no significant difference between GPT 3.5 and 4 regarding the free-response grade.

Conclusions

Existing, widely available LLMs exhibited a solid ability to identify vascular emergencies, with GPT 4.0 agreeing with surgeon attendings in 100% of cases. However, these models continue to have identifiable deficiencies in treatment recommendations, a higher-level task. Future models might help triage incoming consults and provide preliminary management suggestions. The utility of such tools in clinical practice remains to be explored.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大语言模型在虚构咨询问答中的表现显示了人工智能辅助血管外科咨询处理的巨大潜力

目的最近，由于大型语言模型（LLMs）在理解和响应自然语言方面的快速进步，这些工具在医学中的应用已成为一个突出的讨论话题。一些模型已广泛向公众开放，既有专有的，也有开源的。我们的目的是通过了解这些 LLM 处理常见会诊请求的能力，来评估在血管外科中使用这些 LLM 的可能性。要求五名主治外科医生和四名 LLM（GPT 3.5、GPT 4、Bard 和 Falcon 40B）回答每个会诊是否属于需要在一小时内立即处理的急诊。回答者还被问及下一个最佳步骤是检查、补充成像还是紧急手术。GPT 3.5 和 GPT 4 还提供了关于下一个最佳步骤的自由回答，由主治外科医生根据科学准确性、可能的危害性和内容完整性进行评分。结果 GPT 3.5、GPT 4、Falcon 40B 和 Bard 的急诊识别准确率分别为 88%、100%、76% 和 88%。虽然它们的总体准确率相似，但 GPT 3.5 的灵敏度较高，为 100%，而 Bard 的特异性较高，为 90%。GPT 4.0 的敏感性和特异性均为 100%。在 64% 的病例（GPT 3.5）、32% 的病例（GPT 4）、68% 的病例（Falcon 40B）和 36% 的病例（Bard）中，LLM 同意大多数外科医生关于下一个最佳步骤的意见。在 GPT 3.5 和 4 中，89.5% 的答案符合科学共识。只有 5% 的答案极有可能造成临床重大伤害。虽然只有 4% 的答案包含错误内容，但仍有 17.5% 的答案遗漏了重要内容。结论现有的、广泛使用的 LLMs 在识别血管急症方面表现出了很强的能力，GPT 4.0 与外科医生主治医生的意见一致率为 100%。然而，这些模型在治疗建议这一更高层次的任务上仍存在明显不足。未来的模型可能会帮助对前来咨询的病人进行分流，并提供初步的治疗建议。此类工具在临床实践中的实用性仍有待探索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JVS-vascular insights

自引率

0.00%

发文量