A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.

Luke Miller, Peter Kamel, Jigar Patel, Jay Agrawal, Min Zhan, Nathan Bumbarger, Kenneth Wang
{"title":"A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.","authors":"Luke Miller, Peter Kamel, Jigar Patel, Jay Agrawal, Min Zhan, Nathan Bumbarger, Kenneth Wang","doi":"10.1007/s10278-024-01161-3","DOIUrl":null,"url":null,"abstract":"<p><p>Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was \"optimal\" or \"not optimal\" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an \"optimal\" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-024-01161-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was "optimal" or "not optimal" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an "optimal" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
神经成像临床决策支持中大型语言模型实用性的比较评估。
近年来,成像利用率急剧上升,其中至少有一些研究并不适合临床情况。大型语言模型(LLM)的开发可能会解决这一问题,为下单的医疗服务提供者提供更方便的参考资源,但目前对其相对性能的研究还不充分。针对神经放射学临床场景,评估并比较由八个公开可用的模型生成的成像建议的相对适当性和实用性。我们选择了 24 种常见的神经放射学临床场景,这些场景通常会导致成像利用率不达标。我们设计了一些问题来评估 LLM 提供准确和可行建议的能力。2023 年 8 月,我们使用 1-2 句自然语言查询对 LLM 进行了评估,查询内容是根据某些临床参数对最佳影像排序提出建议。我们选择了八种最知名的 LLM 进行评估:ChatGPT、GPT4、Bard(版本 1 和 2)、Bing Chat、Llama 2、Perplexity 和 Claude。根据 ACR 适宜性标准或新奥尔良头颅 CT 标准,由三位接受过研究培训的神经放射学专家对这些模型的建议是 "最佳 "还是 "非最佳 "进行评分。评分者还根据模型答复的适当性、有用性、简洁性和引用来源进行了排名。根据这些情况,模型提供 "最佳 "建议的能力各不相同,具体如下:ChatGPT (20/24)、GPT4 (23/24)、Bard 1 (13/24)、Bard 2 (14/24)、Bing Chat (14/24)、Llama (5/24)、Perplexity (19/24) 和 Claude (19/24)。LLM 的排名中位数如下:ChatGPT(3)、Perplexity(19/24)和 Clude(19/24):ChatGPT (3)、GPT4 (1.5)、Bard 1 (4.5)、Bard 2 (5)、Bing Chat (6)、Llama (7.5)、Perplexity (4) 和 Claude (3)。对特征性错误进行了描述和讨论。GPT-4、ChatGPT 和 Claude 的性能普遍优于 Bard、Bing Chat 和 Llama 2。本研究评估了更多公开可用的 LLM 在更接近真实世界用例的环境中的性能,并讨论了这样做的实际挑战。这是第一项评估和比较各种公开可用的 LLM,以确定其神经放射成像建议是否合适的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Development of Periapical Index Score Classification System in Periapical Radiographs Using Deep Learning. Classification of Interventional Radiology Reports into Technique Categories with a Fine-Tuned Large Language Model. Diagnosing Respiratory Variability: Convolutional Neural Networks for Chest X-ray Classification Across Diverse Pulmonary Conditions. Semi-supervised Ensemble Learning for Automatic Interpretation of Lung Ultrasound Videos. Single-View Fluoroscopic X-Ray Pose Estimation: A Comparison of Alternative Loss Functions and Volumetric Scene Representations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1