Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large-Language Models and Six Human Readers of Varying Experience.

IF 4.7 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING American Journal of Roentgenology Pub Date : 2024-09-04 DOI:10.2214/AJR.24.31696
Jong Eun Lee, Ki-Seong Park, Yun-Hyeon Kim, Ho-Chun Song, Byunggeon Park, Yeon Joo Jeong
{"title":"Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large-Language Models and Six Human Readers of Varying Experience.","authors":"Jong Eun Lee, Ki-Seong Park, Yun-Hyeon Kim, Ho-Chun Song, Byunggeon Park, Yeon Joo Jeong","doi":"10.2214/AJR.24.31696","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background:</b> Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. <b>Objective:</b> To compare performance of ChatGPT large-language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. <b>Methods:</b> This retrospective study included 700 patients (mean age, 73.8±29.5 years; 509 male, 191 female) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January, 2020 to December, 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, IVB) for each report using the AJCC 8th-edition staging system, establishing the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and prompt incorporating a staging system summary. Six human readers (two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows, two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. <b>Results:</b> GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, p=.02), GPT-3.5 (57.4%, p<.001), and resident 2 (65.7%, p<.001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, p<.001) and fellowship-trained radiologist 2 (85.4%, p<.001); and not significantly different from the accuracy of fellow 1 (77.7%, p=.09), fellow 2 (75.6%, p=.53), and resident 1 (72.3%, p=.42). <b>Conclusions:</b> The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows, but significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert healthcare professionals. <b>Clinical Impact:</b> The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.</p>","PeriodicalId":55529,"journal":{"name":"American Journal of Roentgenology","volume":null,"pages":null},"PeriodicalIF":4.7000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Roentgenology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2214/AJR.24.31696","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. Objective: To compare performance of ChatGPT large-language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. Methods: This retrospective study included 700 patients (mean age, 73.8±29.5 years; 509 male, 191 female) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January, 2020 to December, 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, IVB) for each report using the AJCC 8th-edition staging system, establishing the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and prompt incorporating a staging system summary. Six human readers (two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows, two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. Results: GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, p=.02), GPT-3.5 (57.4%, p<.001), and resident 2 (65.7%, p<.001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, p<.001) and fellowship-trained radiologist 2 (85.4%, p<.001); and not significantly different from the accuracy of fellow 1 (77.7%, p=.09), fellow 2 (75.6%, p=.53), and resident 1 (72.3%, p=.42). Conclusions: The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows, but significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert healthcare professionals. Clinical Impact: The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用胸部 CT 和 FDG PET/CT 自由文本报告进行肺癌分期:三种 ChatGPT 大语言模型与六位经验各异的人类读者之间的比较。
背景:尽管放射学报告常用于肺癌分期,但由于放射医师的报告风格各异,而且报告中可能存在含糊不清和/或不完整的分期相关信息,因此这项任务具有挑战性。目的:使用胸部 CT 和 FDG PET/CT 自由文本报告,比较 ChatGPT 大语言模型 (LLM) 和具有不同经验的人类读者在肺癌分期中的表现。方法:这项回顾性研究纳入了韩国四家机构的 700 名患者(平均年龄为 73.8±29.5 岁;男性 509 人,女性 191 人),他们在 2020 年 1 月至 2023 年 12 月期间接受了胸部 CT 或 FDG PET/CT 对非小细胞肺癌的初步分期。检查报告采用自由文本格式,完全用英语或英语和韩语混合书写。两位胸部放射科专家在达成共识后,使用 AJCC 第 8 版分期系统确定了每份报告的总体分期组别(IA、IB、IIA、IIB、IIIA、IIIB、IIIC、IVA、IVB),建立了参考标准。三个 ChatGPT 模型(GPT-4o、GPT-4、GPT-3.5)使用基于脚本的应用编程界面、零镜头学习和包含分期系统摘要的提示,为每份报告确定一个总体分期组。六名人类阅读者(两名接受过研究员培训的放射科医生,其经验不如确定参考标准的放射科医生;两名研究员;两名住院医生)也独立确定了总体分期组别。通过 McNemar 检验,比较了 GPT-4o 与其他 LLM 和人类阅读器在九个组别中确定正确分期的总体准确性。结果显示GPT-4o 的总体分期准确率为 74.1%,明显高于 GPT-4(70.1%,p=.02)、GPT-3.5(57.4%,p 结论:表现最好的模型 GPT-4o 与研究员相比,分期准确率没有明显差异,但与接受过研究员培训的放射科医生相比,表现明显较差。研究结果不支持使用 LLM 代替专业医护人员进行肺癌分期。临床影响:研究结果表明,领域专业知识对于完成癌症分期等复杂的专业任务非常重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
12.80
自引率
4.00%
发文量
920
审稿时长
3 months
期刊介绍: Founded in 1907, the monthly American Journal of Roentgenology (AJR) is the world’s longest continuously published general radiology journal. AJR is recognized as among the specialty’s leading peer-reviewed journals and has a worldwide circulation of close to 25,000. The journal publishes clinically-oriented articles across all radiology subspecialties, seeking relevance to radiologists’ daily practice. The journal publishes hundreds of articles annually with a diverse range of formats, including original research, reviews, clinical perspectives, editorials, and other short reports. The journal engages its audience through a spectrum of social media and digital communication activities.
期刊最新文献
Editorial Comment: Acceptance of Advanced MRI Techniques Is Key for Patient Care. Editorial Comment: Postoperative Surveillance for Local Recurrence After Pancreatic Cancer Resection-Practices and Challenge. Editorial Comment: Is It Time for Routine Clinical Implementation of Multiparametric PET? Optimizing Cryoablation for Breast Cancer: Proposals for Extended Follow-Up and Methodological Enhancements. Editorial Comment: CT Surveillance for Local Recurrence After Pancreatic Cancer Resection-Focus on Perivascular Soft Tissue in the Surgical Bed.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1