Purpose: Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.
Materials and methods: Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.
Results: The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discrimination index > 0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.
Conclusions: Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.