Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity

Pradosh Kumar Sarangi, Suvrankar Datta, M. S. Swarup, Swaha Panda, Debasish Swapnesh Kumar Nayak, Archana Malik, Ananda Datta, Himel Mondal
{"title":"Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity","authors":"Pradosh Kumar Sarangi, Suvrankar Datta, M. S. Swarup, Swaha Panda, Debasish Swapnesh Kumar Nayak, Archana Malik, Ananda Datta, Himel Mondal","doi":"10.1055/s-0044-1787974","DOIUrl":null,"url":null,"abstract":"Abstract Background  Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)—Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity—in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods  Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria®. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result  In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = −0.067, p  = 0.54), while there is strong agreement for SATA (ICC = 0.875, p  < 0.001). Conclusion  The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.","PeriodicalId":506648,"journal":{"name":"Indian Journal of Radiology and Imaging","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Journal of Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/s-0044-1787974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Background  Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)—Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity—in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods  Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria®. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result  In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = −0.067, p  = 0.54), while there is strong agreement for SATA (ICC = 0.875, p  < 0.001). Conclusion  The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
肺栓塞成像的放射决策:大型语言模型--Bing、Claude、ChatGPT 和 Perplexity 的准确性和可靠性
摘要 背景 人工智能聊天机器人已显示出增强临床决策和简化医疗保健工作流程的潜力,有可能减轻行政负担。然而,人工智能聊天机器人对临床场景中放射学决策的贡献仍未得到充分探索。本研究评估了微软必应、Claude、ChatGPT 3.5 和 Perplexity 这四种著名的大型语言模型(LLM)在为疑似肺栓塞(PE)的初始成像提供临床决策支持方面的准确性和可靠性。方法 设计了开放式(OE)和全选适用(SATA)问题,涵盖符合美国放射学会适当性标准(American College of Radiology Appropriateness Criteria®)的四种肺栓塞病例情景。这些问题由来自不同地区和机构的三位放射科医生向 LLM 提出。根据既定的评分标准对回答进行评估,OE 回答的最高得分为 2 分,SATA 问题的每个正确答案为 1 分。为便于比较分析,对得分进行了归一化处理(得分除以最高得分)。结果 在 OE 问题中,Perplexity 的准确率最高(0.83),而 Claude 的准确率最低(0.58),Bing 和 ChatGPT 的准确率均为 0.75。在 SATA 问题中,Bing 的准确率最高,为 0.96;Perplexity 的准确率最低,为 0.56;Claude 和 ChatGPT 的准确率均为 0.6。总体而言,OE 问题的得分(0.73)高于 SATA 问题(0.68)。放射科医生对 OE 的评分一致性较差(类内相关系数 [ICC] = -0.067,p = 0.54),而对 SATA 的评分一致性较强(ICC = 0.875,p < 0.001)。结论 本研究揭示了不同 LLM 在 OE 和 SATA 问题上的准确性差异。Perplexity 在 OE 问题中表现优异,而 Bing 则在 SATA 问题中表现突出。OE 查询的总体结果更好。目前 LLM 准确性的不一致凸显了在将这些工具可靠地整合到临床实践中之前进一步完善的重要性,放射科医生需要对 LLM 进行额外的微调和明智的选择,以便为决策提供一致、可靠的支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Efficacy of Ultrasound-Guided Injection of Platelet-Rich Plasma in Treatment of Sports-Related Meniscal Injuries Never Regret Trying Image-Guided Sclerotherapy in Orbital Low-Flow Malformation Diagnostic Imaging Performance of Dual-Energy Computed Tomography Compared with Conventional Computed Tomography and Magnetic Resonance Imaging for Uterine Cervical Cancer Spontaneous Chest Wall Hernias: Intercostal Lung Hernia and Inverted Intercostal Hernia WhatsApp and Its Role in Teleradiology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1