Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio
{"title":"Assessing the reliability of ChatGPT4 in the appropriateness of radiology referrals","authors":"Marco Parillo , Federica Vaccarino , Daniele Vertulli , Gloria Perillo , Bruno Beomonte Zobel , Carlo Augusto Mallio","doi":"10.1016/j.rcro.2024.100155","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).</p></div><div><h3>Method</h3><p>In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).</p></div><div><h3>Results</h3><p>RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).</p></div><div><h3>Conclusions</h3><p>ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.</p></div>","PeriodicalId":101248,"journal":{"name":"The Royal College of Radiologists Open","volume":"2 ","pages":"Article 100155"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2773066224000068/pdfft?md5=a64e9eb96e6fe951627a494b801f534c&pid=1-s2.0-S2773066224000068-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Royal College of Radiologists Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773066224000068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).
Method
In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).
Results
RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).
Conclusions
ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.