Large-Scale Validation of the Feasibility of GPT-4 as a Proofreading Tool for Head CT Reports.

IF 12.1 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Radiology Pub Date : 2025-01-01 DOI:10.1148/radiol.240701
Songsoo Kim, Donghyun Kim, Hyun Joo Shin, Seung Hyun Lee, Yeseul Kang, Sejin Jeong, Jaewoong Kim, Miran Han, Seong-Joon Lee, Joonho Kim, Jungyon Yum, Changho Han, Dukyong Yoon
{"title":"Large-Scale Validation of the Feasibility of GPT-4 as a Proofreading Tool for Head CT Reports.","authors":"Songsoo Kim, Donghyun Kim, Hyun Joo Shin, Seung Hyun Lee, Yeseul Kang, Sejin Jeong, Jaewoong Kim, Miran Han, Seong-Joon Lee, Joonho Kim, Jungyon Yum, Changho Han, Dukyong Yoon","doi":"10.1148/radiol.240701","DOIUrl":null,"url":null,"abstract":"<p><p>Background The increasing workload of radiologists can lead to burnout and errors in radiology reports. Large language models, such as OpenAI's GPT-4, hold promise as error revision tools for radiology. Purpose To test the feasibility of GPT-4 use by determining its error detection, reasoning, and revision performance on head CT reports with varying error types and to validate its clinical utility by comparison with human readers. Materials and Methods A total of 10 300 head CT reports were retrospectively extracted from the Medical Information Mart for Intensive Care III public dataset. In experiment 1, among the 300 unaltered reports and 300 versions with applied errors, GPT-4 optimization was initially conducted with 200 reports. The remaining 400 were used for evaluation of error type detection, reasoning, and revision, as well as the analysis of reports with undetected errors. The performance was also compared with that of human readers. In experiment 2, the detection performance of GPT-4 was validated on 10 000 unaltered reports that were deemed error-free by physicians, and an analysis of false-positive results was conducted. A permutation test was conducted to assess differences in performance. Results GPT-4 demonstrated commendable performance in error detection (sensitivity, 84% for interpretive error and 89% for factual error), reasoning, and revision. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33-0.69 vs 0.89; <i>P</i> = .008 for radiologist 4, <i>P</i> < .001 for others) and took longer to review (82-121 seconds vs 16 seconds, <i>P</i> < .001). In 10 000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial. Conclusion GPT-4 effectively detects, reasons, and revises errors in radiology reports. While it shows excellent performance in identifying factual errors, its ability to prioritize clinically significant findings is limited. Recognizing its strengths and limitations, GPT-4 could serve as a feasible tool. © RSNA, 2025 <i>Supplemental material is available for this article.</i> See also the editorial by Choi in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"314 1","pages":"e240701"},"PeriodicalIF":12.1000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240701","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background The increasing workload of radiologists can lead to burnout and errors in radiology reports. Large language models, such as OpenAI's GPT-4, hold promise as error revision tools for radiology. Purpose To test the feasibility of GPT-4 use by determining its error detection, reasoning, and revision performance on head CT reports with varying error types and to validate its clinical utility by comparison with human readers. Materials and Methods A total of 10 300 head CT reports were retrospectively extracted from the Medical Information Mart for Intensive Care III public dataset. In experiment 1, among the 300 unaltered reports and 300 versions with applied errors, GPT-4 optimization was initially conducted with 200 reports. The remaining 400 were used for evaluation of error type detection, reasoning, and revision, as well as the analysis of reports with undetected errors. The performance was also compared with that of human readers. In experiment 2, the detection performance of GPT-4 was validated on 10 000 unaltered reports that were deemed error-free by physicians, and an analysis of false-positive results was conducted. A permutation test was conducted to assess differences in performance. Results GPT-4 demonstrated commendable performance in error detection (sensitivity, 84% for interpretive error and 89% for factual error), reasoning, and revision. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33-0.69 vs 0.89; P = .008 for radiologist 4, P < .001 for others) and took longer to review (82-121 seconds vs 16 seconds, P < .001). In 10 000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial. Conclusion GPT-4 effectively detects, reasons, and revises errors in radiology reports. While it shows excellent performance in identifying factual errors, its ability to prioritize clinically significant findings is limited. Recognizing its strengths and limitations, GPT-4 could serve as a feasible tool. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Choi in this issue.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将 GPT-4 作为头部 CT 报告校对工具的可行性大规模验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Radiology
Radiology 医学-核医学
CiteScore
35.20
自引率
3.00%
发文量
596
审稿时长
3.6 months
期刊介绍: Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.
期刊最新文献
A Leadership Primer. COVID-19 Infection and Coronary Plaque Progression: An Early Warning of a Potential Public Health Crisis. Advancing Care: Managing Small Late-Recurrence Hepatocellular Carcinoma with Image-guided Therapy. AI-generated Clinical Histories for Radiology Reports: Closing the Information Gap. CT Honeycombing and Traction Bronchiectasis Extent Independently Predict Survival across Fibrotic Interstitial Lung Disease Subtypes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1