This study developed a generative pre-trained transformer (GPT)-based rater to assess communication skills using the Gap-Kalamazoo Communication Skills Assessment Form (GKCSAF), and examined its inter-rater reliability and concurrent validity. The GPT rater assessed 80 therapist-patient interaction transcripts previously assessed by human raters. For inter-rater reliability, at the total-score level, the GPT rater's assessments showed acceptable differences (mean absolute error % [MAE%] = 12.2%-21.0%). However, we found low intraclass correlation coefficients (ICC) with human ratings (0.00-0.35), which might be due to limited score variability. At the domain level, only four domains showed acceptable differences (MAE% ≤ 30.3%) but all nine domains showed poor agreements (weighted κ ≤ 0.38). For concurrent validity, the GPT rater's assessments also showed acceptable differences, but low ICC values compared to average human scores at both the total-score level (MAE% = 10.8%-11.5%; ICC = 0.12-0.36) and domain level (MAE% = 14.0%-30.3%; ICC = 0.00-0.37). Overall, the GPT rater may serve as a supplementary tool for providing total scores in low-stakes assessments of communication skills. Its performance at the domain level appears limited, highlighting the need for caution in domain interpretation and the importance of further refinement for high-stakes or detailed assessment contexts.
扫码关注我们
求助内容:
应助结果提醒方式:
