ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports

IF 1.5 4区 医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Clinical Imaging Pub Date : 2025-05-01 Epub Date: 2025-03-13 DOI:10.1016/j.clinimag.2025.110455
Ria Singh , Mohamed Hamouda , Jordan H. Chamberlin , Adrienn Tóth , James Munford , Matthew Silbergleit , Dhiraj Baruah , Jeremy R. Burt , Ismail M. Kabakus
{"title":"ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports","authors":"Ria Singh ,&nbsp;Mohamed Hamouda ,&nbsp;Jordan H. Chamberlin ,&nbsp;Adrienn Tóth ,&nbsp;James Munford ,&nbsp;Matthew Silbergleit ,&nbsp;Dhiraj Baruah ,&nbsp;Jeremy R. Burt ,&nbsp;Ismail M. Kabakus","doi":"10.1016/j.clinimag.2025.110455","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports.</div></div><div><h3>Material and methods</h3><div>A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign Lung-RADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM.</div></div><div><h3>Results</h3><div>ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT-4o demonstrated the greatest agreement with radiologists (κ = 0.836), although it was less than the previously reported human interobserver agreement.</div></div><div><h3>Conclusion</h3><div>ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.</div></div>","PeriodicalId":50680,"journal":{"name":"Clinical Imaging","volume":"121 ","pages":"Article 110455"},"PeriodicalIF":1.5000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0899707125000555","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/13 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports.

Material and methods

A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign Lung-RADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM.

Results

ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT-4o demonstrated the greatest agreement with radiologists (κ = 0.836), although it was less than the previously reported human interobserver agreement.

Conclusion

ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ChatGPT与Gemini:放射学报告中肺部rads评分分配的比较准确性和效率
目的评价基于肺癌筛查低剂量ct放射学报告的大语言模型(LLMs)生成lung - rads评分的准确性。材料和方法回顾性横断面分析了某三级中心接受过心胸科奖学金培训的放射科医师连续报告的242份LDCT放射学报告。评估的llm包括ChatGPT-3.5、chatgpt - 40、谷歌Gemini和谷歌Gemini Advanced。每个LLM被用来根据每个报告的发现部分分配Lung-RADS分数。没有应用特定于领域的微调。准确性是通过比较法学硕士分配的分数和放射科医生分配的分数来确定的。通过测量每个LLM的响应时间来评估效率。结果与其他模型相比,chatgpt - 40对肺- rads评分的准确率最高(83.6%),ChatGPT-3.5对肺- rads评分的准确率达到70.1%。Gemini和Gemini Advanced的准确率相似(分别为70.9%和65.1%)。ChatGPT-3.5的反应时间最快(中位数为4秒),而chatgpt - 40的反应时间较慢(中位数为10秒)。更高的肺部rads类别与稍长的完成时间相关。chatgpt - 40表现出与放射科医生的最大一致性(κ = 0.836),尽管它低于先前报道的人类观察者之间的一致性。结论chatgpt - 40在Lung-RADS评分分配准确性上优于ChatGPT-3.5、Gemini和Gemini Advanced,但未达到人类专家水平。尽管取得了令人鼓舞的结果,但需要进一步的工作来整合特定领域的培训,并确保LLM在肺癌筛查临床决策中的可靠性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Clinical Imaging
Clinical Imaging 医学-核医学
CiteScore
4.60
自引率
0.00%
发文量
265
审稿时长
35 days
期刊介绍: The mission of Clinical Imaging is to publish, in a timely manner, the very best radiology research from the United States and around the world with special attention to the impact of medical imaging on patient care. The journal''s publications cover all imaging modalities, radiology issues related to patients, policy and practice improvements, and clinically-oriented imaging physics and informatics. The journal is a valuable resource for practicing radiologists, radiologists-in-training and other clinicians with an interest in imaging. Papers are carefully peer-reviewed and selected by our experienced subject editors who are leading experts spanning the range of imaging sub-specialties, which include: -Body Imaging- Breast Imaging- Cardiothoracic Imaging- Imaging Physics and Informatics- Molecular Imaging and Nuclear Medicine- Musculoskeletal and Emergency Imaging- Neuroradiology- Practice, Policy & Education- Pediatric Imaging- Vascular and Interventional Radiology
期刊最新文献
CTA of the adult native aorta: Everything you wanted to know (but were never told) Comment on “AI-assisted chest radiograph interpretation enhances diagnostic confidence and standardizes diagnostic accuracy across radiologists” Third and fourth branchial cleft cysts – A 25-year institutional experience of imaging detection and pathologic diagnosis Extrinsic versus intrinsic stenosis as a means of clinical triage for patients with symptomatic dural venous sinus stenosis Comment on ‘Limited performance of ChatGPT-4v and ChatGPT-4o in image-based core radiology cases’
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1