ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language

Philipp Fervers, Robert Hahnfeldt, J. Kottlors, A. Wagner, D. Maintz, D. Pinto dos Santos, Simon Lennartz, T. Persigehl
{"title":"ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language","authors":"Philipp Fervers, Robert Hahnfeldt, J. Kottlors, A. Wagner, D. Maintz, D. Pinto dos Santos, Simon Lennartz, T. Persigehl","doi":"10.3389/fradi.2024.1390774","DOIUrl":null,"url":null,"abstract":"To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.","PeriodicalId":507441,"journal":{"name":"Frontiers in Radiology","volume":" 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fradi.2024.1390774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于德语自由文本和结构化放射学报告的 ChatGPT 在确定 LI-RADS 评分时准确率较低
研究大语言模型(LLM)ChatGPT根据肝脏成像报告和数据系统(LI-RADS)对核磁共振成像报告中的肝脏病变进行分类的可行性,并比较结构化报告与非结构化报告的分类性能。LI-RADS可分类肝脏病变包含在德国书面结构化和非结构化核磁共振成像报告中,报告的大小、位置和动脉相位对比增强是最低包含要求。报告中的检查结果部分被传送到 ChatGPT (GPT-3.5),并指示 ChatGPT 为每个可分类的肝脏病变确定 LI-RADS 分数。地面实况由两名放射科医生共同确定。地面实况与 ChatGPT 之间的一致性用 Cohen's kappa 进行评估。通过将 n = 50 个病灶的子集五次传给 ChatGPT,使用类内相关系数 (ICC) 评估测试再测可靠性。ChatGPT 确定 LI-RADS 类别的准确率较低(非结构化报告和结构化报告的准确率分别为 53% 和 44%)。在自由文本报告中,与地面实况的一致性更高(k = 0.51 和 k = 0.44),LI-RADS 评分的平均绝对误差更小(0.5 ± 0.5 vs. 0.6 ± 0.7,p < 0.05),测试-重复可靠性更高(ICC = 0.81 vs. 0.50)。50),尽管结构化报告中包含最低要求成像特征的频率明显更高(Chi-square 检验,p < 0.05)。当要求 ChatGPT 从肝脏成像报告中确定 LI-RADS 评分时,其准确性较低。我们的研究表明,有必要针对结构化临床数据输入对 LLM 进行优化,同时 LLM 也具有在大型自由文本放射数据库的基础上创建机器可读标签的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language A systematic literature review: deep learning techniques for synthetic medical image generation and their applications in radiotherapy Left atrial diastasis strain slope is a marker of hemodynamic recovery in post-ST elevation myocardial infarction: the Laser Atherectomy for STemi, Pci Analysis with Scintigraphy Study (LAST-PASS) Outcome of transarterial radioembolization in patients with hepatocellular carcinoma as a first-line interventional therapy and after a previous transarterial chemoembolization Editorial: Radiomics and AI for clinical and translational medicine
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1