Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity

Indian Journal of Radiology and Imaging Pub Date : 2024-07-04 DOI:10.1055/s-0044-1787974

Pradosh Kumar Sarangi, Suvrankar Datta, M. S. Swarup, Swaha Panda, Debasish Swapnesh Kumar Nayak, Archana Malik, Ananda Datta, Himel Mondal

{"title":"Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity","authors":"Pradosh Kumar Sarangi, Suvrankar Datta, M. S. Swarup, Swaha Panda, Debasish Swapnesh Kumar Nayak, Archana Malik, Ananda Datta, Himel Mondal","doi":"10.1055/s-0044-1787974","DOIUrl":null,"url":null,"abstract":"Abstract Background Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)—Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity—in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria®. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = −0.067, p = 0.54), while there is strong agreement for SATA (ICC = 0.875, p < 0.001). Conclusion The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.","PeriodicalId":506648,"journal":{"name":"Indian Journal of Radiology and Imaging","volume":" 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Journal of Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/s-0044-1787974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Background Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)—Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity—in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria®. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = −0.067, p = 0.54), while there is strong agreement for SATA (ICC = 0.875, p < 0.001). Conclusion The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

肺栓塞成像的放射决策：大型语言模型--Bing、Claude、ChatGPT 和 Perplexity 的准确性和可靠性

摘要背景人工智能聊天机器人已显示出增强临床决策和简化医疗保健工作流程的潜力，有可能减轻行政负担。然而，人工智能聊天机器人对临床场景中放射学决策的贡献仍未得到充分探索。本研究评估了微软必应、Claude、ChatGPT 3.5 和 Perplexity 这四种著名的大型语言模型（LLM）在为疑似肺栓塞（PE）的初始成像提供临床决策支持方面的准确性和可靠性。方法设计了开放式（OE）和全选适用（SATA）问题，涵盖符合美国放射学会适当性标准（American College of Radiology Appropriateness Criteria®）的四种肺栓塞病例情景。这些问题由来自不同地区和机构的三位放射科医生向 LLM 提出。根据既定的评分标准对回答进行评估，OE 回答的最高得分为 2 分，SATA 问题的每个正确答案为 1 分。为便于比较分析，对得分进行了归一化处理（得分除以最高得分）。结果在 OE 问题中，Perplexity 的准确率最高（0.83），而 Claude 的准确率最低（0.58），Bing 和 ChatGPT 的准确率均为 0.75。在 SATA 问题中，Bing 的准确率最高，为 0.96；Perplexity 的准确率最低，为 0.56；Claude 和 ChatGPT 的准确率均为 0.6。总体而言，OE 问题的得分（0.73）高于 SATA 问题（0.68）。放射科医生对 OE 的评分一致性较差（类内相关系数 [ICC] = -0.067，p = 0.54），而对 SATA 的评分一致性较强（ICC = 0.875，p < 0.001）。结论本研究揭示了不同 LLM 在 OE 和 SATA 问题上的准确性差异。Perplexity 在 OE 问题中表现优异，而 Bing 则在 SATA 问题中表现突出。OE 查询的总体结果更好。目前 LLM 准确性的不一致凸显了在将这些工具可靠地整合到临床实践中之前进一步完善的重要性，放射科医生需要对 LLM 进行额外的微调和明智的选择，以便为决策提供一致、可靠的支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Indian Journal of Radiology and Imaging

自引率

0.00%

发文量