Naji J Touma, Ruchit Patel, Thomas Skinner, Michael Leveridge
{"title":"Artificial Intelligence as a Discriminator of Competence in Urological Training: Are We There?","authors":"Naji J Touma, Ruchit Patel, Thomas Skinner, Michael Leveridge","doi":"10.1097/JU.0000000000004357","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.</p><p><strong>Materials and methods: </strong>Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.</p><p><strong>Results: </strong>The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discrimination index > 0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.</p><p><strong>Conclusions: </strong>Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.</p>","PeriodicalId":17471,"journal":{"name":"Journal of Urology","volume":" ","pages":"504-511"},"PeriodicalIF":5.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Urology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/JU.0000000000004357","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/9 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.
Materials and methods: Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.
Results: The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discrimination index > 0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.
Conclusions: Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.
期刊介绍:
The Official Journal of the American Urological Association (AUA), and the most widely read and highly cited journal in the field, The Journal of Urology® brings solid coverage of the clinically relevant content needed to stay at the forefront of the dynamic field of urology. This premier journal presents investigative studies on critical areas of research and practice, survey articles providing short condensations of the best and most important urology literature worldwide, and practice-oriented reports on significant clinical observations.