评估 ChatGPT4 在放射科转诊适当性方面的可靠性

Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio
{"title":"评估 ChatGPT4 在放射科转诊适当性方面的可靠性","authors":"Marco Parillo ,&nbsp;Federica Vaccarino ,&nbsp;Daniele Vertulli ,&nbsp;Gloria Perillo ,&nbsp;Bruno Beomonte Zobel ,&nbsp;Carlo Augusto Mallio","doi":"10.1016/j.rcro.2024.100155","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).</p></div><div><h3>Method</h3><p>In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).</p></div><div><h3>Results</h3><p>RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).</p></div><div><h3>Conclusions</h3><p>ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.</p></div>","PeriodicalId":101248,"journal":{"name":"The Royal College of Radiologists Open","volume":"2 ","pages":"Article 100155"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2773066224000068/pdfft?md5=a64e9eb96e6fe951627a494b801f534c&pid=1-s2.0-S2773066224000068-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals\",\"authors\":\"Marco Parillo ,&nbsp;Federica Vaccarino ,&nbsp;Daniele Vertulli ,&nbsp;Gloria Perillo ,&nbsp;Bruno Beomonte Zobel ,&nbsp;Carlo Augusto Mallio\",\"doi\":\"10.1016/j.rcro.2024.100155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><p>To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).</p></div><div><h3>Method</h3><p>In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).</p></div><div><h3>Results</h3><p>RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).</p></div><div><h3>Conclusions</h3><p>ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.</p></div>\",\"PeriodicalId\":101248,\"journal\":{\"name\":\"The Royal College of Radiologists Open\",\"volume\":\"2 \",\"pages\":\"Article 100155\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2773066224000068/pdfft?md5=a64e9eb96e6fe951627a494b801f534c&pid=1-s2.0-S2773066224000068-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Royal College of Radiologists Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2773066224000068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Royal College of Radiologists Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773066224000068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目的 研究 ChatGPT 使用检查成像报告和数据系统(RI-RADS)对成像请求进行分级的可靠性。方法 在这项单中心回顾性研究中,共纳入了 450 例成像转诊。两名人类阅读者根据 RI-RADS 对所有请求进行独立评分。我们创建了一个定制的 RI-RADS GPT,将请求作为输入进行复制和粘贴,输出 RI-RADS 分数及其三个子类别的评估结果。使用皮尔逊卡方检验来评估放射科医生和 ChatGPT 分配的数据分布是否有显著差异。使用科恩卡帕(κ)评估了 RI-RADS 总分及其三个子类别的评分者之间的可靠性。结果RI-RADS D 是人类给出的最普遍的等级(54% 的病例),而 ChatGPT 更经常给出 RI-RADS C(33% 的病例)。在 2% 的病例中,ChatGPT 根据对子类别的评级错误地给出了 RI-RADS 等级。除了 RI-RADS 等级 C 和 X 外,放射科医生和 ChatGPT 之间的 RI-RADS 等级和子类别分布在统计学上存在显著差异。放射科医生和 ChatGPT 在分配 RI-RADS 分数时的可靠性非常低(κ:0.20),而两位人类读者之间的一致性几乎完美(κ:0.96)。此外,完整的成像转诊数量较少,这突出表明需要改进流程以确保放射检查申请的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals

Purpose

To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).

Method

In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).

Results

RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).

Conclusions

ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Treatment of follicular lymphoma with a focus on radiotherapy Dual energy CT for the diagnosis of gout: Evaluating the optimal Hounsfield unit setting for dual energy processing Evaluating intra and inter-observer bias in the cosmetic rating for random vs. serial assessment of breast photographs In regard to Belani et al. Response to: Letter to editor RE: “Evaluating intra and inter-observer bias in the cosmetic rating for random vs. serial assessment of breast photographs”, Belani et al.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1