Purpose
To investigate the reliability of ChatGPT in grading imaging requests using the Reason for exam Imaging Reporting and Data System (RI-RADS).
Method
In this single-center retrospective study, a total of 450 imaging referrals were included. Two human readers independently scored all requests according to RI-RADS. We created a customized RI-RADS GPT where the requests were copied and pasted as inputs, getting as an output the RI-RADS score along with the evaluation of its three subcategories. Pearson's chi-squared test was used to assess whether the distributions of data assigned by the radiologist and ChatGPT differed significantly. Inter-rater reliability for both the overall RI-RADS score and its three subcategories was assessed using Cohen's kappa (κ).
Results
RI-RADS D was the most prevalent grade assigned by humans (54% of cases), while ChatGPT more frequently assigned the RI-RADS C (33% of cases). In 2% of cases, ChatGPT assigned the wrong RI-RADS grade, based on the ratings given to the subcategories. The distributions of the RI-RADS grade and the subcategories differed statistically significantly between the radiologist and ChatGPT, apart from RI-RADS grades C and X. The reliability between the radiologist and ChatGPT in assigning RI-RADS score was very low (κ: 0.20), while the agreement between the two human readers was almost perfect (κ: 0.96).
Conclusions
ChatGPT may not be reliable for independently scoring the radiology exam requests according to RI-RADS and its subcategories. Furthermore, the low number of complete imaging referrals highlights the need for improved processes to ensure the quality of radiology requests.