ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports.

Matthew Silbergleit, Adrienn Tóth, Jordan H Chamberlin, Mohamed Hamouda, Dhiraj Baruah, Sydney Derrick, U Joseph Schoepf, Jeremy R Burt, Ismail M Kabakus
{"title":"ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports.","authors":"Matthew Silbergleit, Adrienn Tóth, Jordan H Chamberlin, Mohamed Hamouda, Dhiraj Baruah, Sydney Derrick, U Joseph Schoepf, Jeremy R Burt, Ismail M Kabakus","doi":"10.1007/s10278-024-01328-y","DOIUrl":null,"url":null,"abstract":"<p><p>This study aimed to evaluate the accuracy and efficiency of ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced in generating CAD-RADS scores based on radiology reports. This retrospective study analyzed 100 consecutive coronary computed tomography angiography reports performed between March 15, 2024, and April 1, 2024, at a single tertiary center. Each report containing a radiologist-assigned CAD-RADS score was processed using four large language models (LLMs) without fine-tuning. The findings section of each report was input into the LLMs, and the models were tasked with generating CAD-RADS scores. The accuracy of LLM-generated scores was compared to the radiologist's score. Additionally, the time taken by each model to complete the task was recorded. Statistical analyses included Mann-Whitney U test and interobserver agreement using unweighted Cohen's Kappa and Krippendorff's Alpha. ChatGPT-4o demonstrated the highest accuracy, correctly assigning CAD-RADS scores in 87% of cases (κ = 0.838, α = 0.886), followed by Gemini Advanced with 82.6% accuracy (κ = 0.784, α = 0.897). ChatGPT-3.5, although the fastest (median time = 5 s), was the least accurate (50.5% accuracy, κ = 0.401, α = 0.787). Gemini exhibited a higher failure rate (12%) compared to the other models, with Gemini Advanced slightly improving upon its predecessor. ChatGPT-4o outperformed other LLMs in both accuracy and agreement with radiologist-assigned CAD-RADS scores, though ChatGPT-3.5 was significantly faster. Despite their potential, current publicly available LLMs require further refinement before being deployed for clinical decision-making in CAD-RADS scoring.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":"2303-2311"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343400/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-024-01328-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study aimed to evaluate the accuracy and efficiency of ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced in generating CAD-RADS scores based on radiology reports. This retrospective study analyzed 100 consecutive coronary computed tomography angiography reports performed between March 15, 2024, and April 1, 2024, at a single tertiary center. Each report containing a radiologist-assigned CAD-RADS score was processed using four large language models (LLMs) without fine-tuning. The findings section of each report was input into the LLMs, and the models were tasked with generating CAD-RADS scores. The accuracy of LLM-generated scores was compared to the radiologist's score. Additionally, the time taken by each model to complete the task was recorded. Statistical analyses included Mann-Whitney U test and interobserver agreement using unweighted Cohen's Kappa and Krippendorff's Alpha. ChatGPT-4o demonstrated the highest accuracy, correctly assigning CAD-RADS scores in 87% of cases (κ = 0.838, α = 0.886), followed by Gemini Advanced with 82.6% accuracy (κ = 0.784, α = 0.897). ChatGPT-3.5, although the fastest (median time = 5 s), was the least accurate (50.5% accuracy, κ = 0.401, α = 0.787). Gemini exhibited a higher failure rate (12%) compared to the other models, with Gemini Advanced slightly improving upon its predecessor. ChatGPT-4o outperformed other LLMs in both accuracy and agreement with radiologist-assigned CAD-RADS scores, though ChatGPT-3.5 was significantly faster. Despite their potential, current publicly available LLMs require further refinement before being deployed for clinical decision-making in CAD-RADS scoring.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ChatGPT 与 Gemini:根据放射报告进行 CAD-RADS 评分的准确性和效率比较。
本研究旨在评估 ChatGPT-3.5、ChatGPT-4o、Google Gemini 和 Google Gemini Advanced 根据放射学报告生成 CAD-RADS 评分的准确性和效率。这项回顾性研究分析了 2024 年 3 月 15 日至 2024 年 4 月 1 日期间在一家三级中心进行的 100 份连续冠状动脉计算机断层扫描血管造影报告。使用四个大型语言模型(LLM)对每份包含放射科医生指定的 CAD-RADS 评分的报告进行了处理,未作任何微调。每份报告的检查结果部分都被输入到 LLM 中,模型的任务是生成 CAD-RADS 评分。将 LLM 生成的分数的准确性与放射科医生的分数进行比较。此外,还记录了每个模型完成任务所需的时间。统计分析包括 Mann-Whitney U 检验和使用非加权 Cohen's Kappa 和 Krippendorff's Alpha 的观察者间一致性。ChatGPT-4o 的准确率最高,87% 的病例都能正确分配 CAD-RADS 分数(κ = 0.838,α = 0.886),其次是 Gemini Advanced,准确率为 82.6%(κ = 0.784,α = 0.897)。ChatGPT-3.5 虽然速度最快(中位数时间 = 5 秒),但准确率最低(准确率为 50.5%,κ = 0.401,α = 0.787)。与其他模型相比,Gemini 的失败率较高(12%),而 Gemini Advanced 比其前身略有改进。ChatGPT-4o 在准确性和与放射科医生指定的 CAD-RADS 评分的一致性方面均优于其他 LLM,但 ChatGPT-3.5 明显更快。尽管目前公开发布的 LLMs 具有潜力,但在用于 CAD-RADS 评分的临床决策之前,还需要进一步完善。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multiclass Classification of Renal Tumor Subtypes: Addressing Diagnostic Challenges Using a Texture-Informed Deep Hybrid CNN-Transformer. Fully Automatic Diabetic Wound Segmentation Using Lightweight Deep Convolutional Neural Networks. Chronic Subdural Hematoma Segmentation: A Dedicated Model to Overcome the Limitations of Acute Hemorrhage Segmentation Across Chronic Subdural Hematoma Subtypes and Density Variations. Gated Backbone Fusion with Transformer Encoder for Diabetic Foot Osteomyelitis Screening and Localization in Radiographs. Alzheimer's and Parkinson's Detection with Video-Based Hybrid Deep Learning from Brain MRI.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1