Evaluating the progression of artificial intelligence and large language models in medicine through comparative analysis of ChatGPT-3.5 and ChatGPT-4 in generating vascular surgery recommendations

JVS-vascular insights Pub Date : 2024-01-01 DOI:10.1016/j.jvsvi.2023.100049

Arshia P. Javidan MD, MSc , Tiam Feridooni MD, PhD , Lauren Gordon MD, PhD , Sean A. Crawford MD, PhD

{"title":"Evaluating the progression of artificial intelligence and large language models in medicine through comparative analysis of ChatGPT-3.5 and ChatGPT-4 in generating vascular surgery recommendations","authors":"Arshia P. Javidan MD, MSc , Tiam Feridooni MD, PhD , Lauren Gordon MD, PhD , Sean A. Crawford MD, PhD","doi":"10.1016/j.jvsvi.2023.100049","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Artificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly large language models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the US Medical Licensing Exam, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to (1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and (2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs.</p></div><div><h3>Methods</h3><p>A set of 40 clinician-level questions spanning 4 domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cutoff date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were input into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid grade level of each response was also determined. Independent samples <em>t</em> test and Fisher's exact test were used for comparative analysis.</p></div><div><h3>Results</h3><p>ChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 of 40 questions (95%) as compared with 13 of 40 (32.5%) by ChatGPT-3.5 (Fisher's exact test <em>P</em> < .001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs chatGPT-3.5 mean 265 ± 74 words; <em>P</em> < .001), the reading ease of both models remained similar, corresponding with college-level graduate texts.</p></div><div><h3>Conclusions</h3><p>ChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared with its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that, with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.</p></div>","PeriodicalId":74034,"journal":{"name":"JVS-vascular insights","volume":"2 ","pages":"Article 100049"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949912723000466/pdfft?md5=c3655f45a31080cfd13797a6738f0b01&pid=1-s2.0-S2949912723000466-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JVS-vascular insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949912723000466","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Artificial intelligence (AI) continues to become increasingly integrated with clinical medicine. Generative AI, and particularly large language models (LLMs) like ChatGPT-3.5 and ChatGPT-4, have shown promise in generating human-like text, providing a potential tool for augmenting clinical care. These online AI chatbots have already demonstrated remarkable clinical potential, having passed the US Medical Licensing Exam, for example. The evaluation of these LLMs in the surgical literature, especially as it applies to judgement and decision-making, is sparse. This study aimed to (1) evaluate the efficacy of ChatGPT-4 in providing clinician-level vascular surgery recommendations and (2) compare its performance with its predecessor, ChatGPT-3.5, to gauge the progression of clinical competencies of LLMs.

Methods

A set of 40 clinician-level questions spanning 4 domains of vascular surgery (carotid artery disease, visceral artery aneurysms, abdominal aortic aneurysms, chronic limb-threatening ischemia) were generated by clinical experts. These domains were chosen based on the availability of updated guidelines published before September 2021, which served as the cutoff date for the training dataset of the LLMs. The questions, devoid of additional context or prompts, were input into ChatGPT-3.5 and ChatGPT-4 between March 20 and March 25, 2023. Responses were independently evaluated by two blinded reviewers using a 5-point Likert scale assessing comprehensiveness, accuracy, and consistency with guidelines. The Flesch-Kincaid grade level of each response was also determined. Independent samples t test and Fisher's exact test were used for comparative analysis.

Results

ChatGPT-4 significantly outperformed ChatGPT-3.5 by providing appropriate recommendations in 38 of 40 questions (95%) as compared with 13 of 40 (32.5%) by ChatGPT-3.5 (Fisher's exact test P < .001). Despite longer response lengths (chatGPT-4 mean 317 ± 58 words vs chatGPT-3.5 mean 265 ± 74 words; P < .001), the reading ease of both models remained similar, corresponding with college-level graduate texts.

Conclusions

ChatGPT-4 can consistently respond accurately to complex clinician-level vascular surgery questions. This also represents a substantial advancement in performance compared with its predecessor, which was released only a few months prior, highlighting the progress of performance of LLMs in clinical medicine. Several limitations persist with the use of LLMs, including hallucinations, data privacy issues, and the black box problem, However, these findings suggest that, with further refinements, LLMs like ChatGPT-4 have the potential to become indispensable tools in clinical decision-making, thereby marking an exciting frontier in the fusion of AI with clinical medicine and vascular surgery.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过比较分析 ChatGPT-3.5 和 ChatGPT-4 在生成血管外科手术建议方面的作用，评估人工智能和大语言模型在医学中的应用进展

目的人工智能（AI）与临床医学的结合日益紧密。生成式人工智能，尤其是大型语言模型（LLM），如 ChatGPT-3.5 和 ChatGPT-4，已显示出生成类人文本的前景，为增强临床护理提供了潜在的工具。例如，这些在线人工智能聊天机器人已经通过了美国医疗执照考试，显示出显著的临床潜力。外科文献中对这些 LLM 的评估很少，尤其是在判断和决策方面。本研究旨在：(1) 评估 ChatGPT-4 在提供临床医生级别的血管外科建议方面的功效；(2) 将 ChatGPT-4 与其前身 ChatGPT-3.5 的性能进行比较，以衡量 LLMs 临床能力的进步情况。方法由临床专家生成一套 40 个临床医生级别的问题，涵盖血管外科的 4 个领域（颈动脉疾病、内脏动脉瘤、腹主动脉瘤、慢性肢体缺血）。选择这些领域的依据是 2021 年 9 月之前发布的最新指南，2021 年 9 月是 LLM 培训数据集的截止日期。在 2023 年 3 月 20 日至 3 月 25 日期间，这些没有附加语境或提示的问题被输入到 ChatGPT-3.5 和 ChatGPT-4 中。回答由两名盲审员使用 5 点李克特量表进行独立评估，评估内容的全面性、准确性以及与指南的一致性。同时还确定了每份回复的 Flesch-Kincaid 等级。结果ChatGPT-4的表现明显优于ChatGPT-3.5，在40个问题中有38个（95%）提供了适当的建议，而ChatGPT-3.5在40个问题中有13个（32.5%）提供了适当的建议（费雪精确检验 P <.001）。尽管回答长度更长（chatGPT-4 平均 317 ± 58 个单词 vs chatGPT-3.5 平均 265 ± 74 个单词；P < .001），但两个模型的阅读难易程度仍然相似，与大学毕业水平的课文相当。结论ChatGPT-4能持续准确地回答临床医生级别的复杂血管外科问题，与几个月前才发布的前代产品相比，性能有了大幅提升，这凸显了LLMs在临床医学中的性能进步。然而，这些研究结果表明，经过进一步改进，像 ChatGPT-4 这样的 LLM 有可能成为临床决策中不可或缺的工具，从而标志着人工智能与临床医学和血管外科融合的一个激动人心的前沿领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JVS-vascular insights

自引率

0.00%

发文量