ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language

Frontiers in Radiology Pub Date : 2024-07-05 DOI:10.3389/fradi.2024.1390774

Philipp Fervers, Robert Hahnfeldt, J. Kottlors, A. Wagner, D. Maintz, D. Pinto dos Santos, Simon Lennartz, T. Persigehl

{"title":"ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language","authors":"Philipp Fervers, Robert Hahnfeldt, J. Kottlors, A. Wagner, D. Maintz, D. Pinto dos Santos, Simon Lennartz, T. Persigehl","doi":"10.3389/fradi.2024.1390774","DOIUrl":null,"url":null,"abstract":"To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.","PeriodicalId":507441,"journal":{"name":"Frontiers in Radiology","volume":" 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fradi.2024.1390774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于德语自由文本和结构化放射学报告的 ChatGPT 在确定 LI-RADS 评分时准确率较低

研究大语言模型（LLM）ChatGPT根据肝脏成像报告和数据系统（LI-RADS）对核磁共振成像报告中的肝脏病变进行分类的可行性，并比较结构化报告与非结构化报告的分类性能。LI-RADS可分类肝脏病变包含在德国书面结构化和非结构化核磁共振成像报告中，报告的大小、位置和动脉相位对比增强是最低包含要求。报告中的检查结果部分被传送到 ChatGPT (GPT-3.5)，并指示 ChatGPT 为每个可分类的肝脏病变确定 LI-RADS 分数。地面实况由两名放射科医生共同确定。地面实况与 ChatGPT 之间的一致性用 Cohen's kappa 进行评估。通过将 n = 50 个病灶的子集五次传给 ChatGPT，使用类内相关系数 (ICC) 评估测试再测可靠性。ChatGPT 确定 LI-RADS 类别的准确率较低（非结构化报告和结构化报告的准确率分别为 53% 和 44%）。在自由文本报告中，与地面实况的一致性更高（k = 0.51 和 k = 0.44），LI-RADS 评分的平均绝对误差更小（0.5 ± 0.5 vs. 0.6 ± 0.7，p < 0.05），测试-重复可靠性更高（ICC = 0.81 vs. 0.50）。50），尽管结构化报告中包含最低要求成像特征的频率明显更高（Chi-square 检验，p < 0.05）。当要求 ChatGPT 从肝脏成像报告中确定 LI-RADS 评分时，其准确性较低。我们的研究表明，有必要针对结构化临床数据输入对 LLM 进行优化，同时 LLM 也具有在大型自由文本放射数据库的基础上创建机器可读标签的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Frontiers in Radiology

自引率

0.00%

发文量