Artificial Intelligence as a Discriminator of Competence in Urological Training: Are We There?

IF 5.9 2区 医学 Q1 UROLOGY & NEPHROLOGY Journal of Urology Pub Date : 2025-04-01 Epub Date: 2024-12-09 DOI:10.1097/JU.0000000000004357
Naji J Touma, Ruchit Patel, Thomas Skinner, Michael Leveridge
{"title":"Artificial Intelligence as a Discriminator of Competence in Urological Training: Are We There?","authors":"Naji J Touma, Ruchit Patel, Thomas Skinner, Michael Leveridge","doi":"10.1097/JU.0000000000004357","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.</p><p><strong>Materials and methods: </strong>Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.</p><p><strong>Results: </strong>The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discrimination index > 0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.</p><p><strong>Conclusions: </strong>Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.</p>","PeriodicalId":17471,"journal":{"name":"Journal of Urology","volume":" ","pages":"504-511"},"PeriodicalIF":5.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Urology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/JU.0000000000004357","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/9 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.

Materials and methods: Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.

Results: The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discrimination index > 0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.

Conclusions: Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工智能作为泌尿外科培训能力的判别标准:我们做到了吗?
导读:医学教育中的评估在评估受训者的进步和最终的能力方面起着核心作用。生成式人工智能(AI)在临床护理和医学教育中发挥着越来越大的作用。本研究的目的是评估大型语言模型ChatGPT的能力,以产生在评估泌尿外科毕业住院医师的考试问题。方法:毕业泌尿外科住院医师代表所有加拿大培训计划每年聚集模拟考试,模拟他们即将到来的委员会认证考试。考试包括书面多项选择题(mcq)考试和口头OSCE考试。在2023年,ChatGPT Version 4被用于生成20个mcq,这些mcq被添加到编写的组件中。ChatGPT被要求使用Campbell-Walsh泌尿外科、AUA和CUA指南作为资源。对ChatGPT mcq进行心理测量分析。mcq也由3个教员进行了面部有效性的研究,以确定它们是否来自一个有效的来源。结果:35名考生在ChatGPT mcq上的平均得分为60.7%,而整体考试的平均得分为61.1%。25%的ChatGPT mcq显示了区分指数>.3,这是正确区分考试成绩好坏的问题的门槛。25%的ChatGPT mcq得分为bb0 0.2,这被认为与考试的整体表现高度相关。教师们的评估发现,ChatGPT mcq往往提供了不完整的信息,提供了多个可能正确的答案,有时没有植根于文献。ChatGPT生成的mcq中有35%提供了错误的答案。结论:尽管ChatGPT mcq和整体考试的表现似乎相似,但ChatGPT mcq往往不是高度区分的。措辞拙劣、可能引发人工智能幻觉的问题随处可见。在泌尿科培训考试中使用ChatGPT试题进行评估之前,应仔细审查其质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Urology
Journal of Urology 医学-泌尿学与肾脏学
CiteScore
11.50
自引率
7.60%
发文量
3746
审稿时长
2-3 weeks
期刊介绍: The Official Journal of the American Urological Association (AUA), and the most widely read and highly cited journal in the field, The Journal of Urology® brings solid coverage of the clinically relevant content needed to stay at the forefront of the dynamic field of urology. This premier journal presents investigative studies on critical areas of research and practice, survey articles providing short condensations of the best and most important urology literature worldwide, and practice-oriented reports on significant clinical observations.
期刊最新文献
Urologic Oncology: Bladder, Penis, and Urethral Cancer and Basic Principles of Oncology. Intraoperative Tranexamic Acid in Radical Cystectomy: Impact on Bleeding, Thromboembolism, and Survival Outcomes. Spontaneous Resolution of Primary Obstructive Megaureter: Risk Stratification and Prediction Based on Early Sonographic Factors. Validation of Prognostic Models for Renal Cell Carcinoma Recurrence, Cancer-Specific Mortality, and All-Cause Mortality. Development and Validation of the Length, Segment, and Etiology Anterior Urethral Stricture Disease Staging System Using Longitudinal Urethroplasty Outcomes Data From the Trauma and Urologic Reconstructive Network of Surgeons.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1