AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination.

IF 3.2 2区医学 Q1 EDUCATION & EDUCATIONAL RESEARCH BMC Medical Education Pub Date : 2025-02-08 DOI:10.1186/s12909-025-06796-6

Alex Kk Law, Jerome So, Chun Tat Lui, Yu Fai Choi, Koon Ho Cheung, Kevin Kei-Ching Hung, Colin Alexander Graham

{"title":"AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination.","authors":"Alex Kk Law, Jerome So, Chun Tat Lui, Yu Fai Choi, Koon Ho Cheung, Kevin Kei-Ching Hung, Colin Alexander Graham","doi":"10.1186/s12909-025-06796-6","DOIUrl":null,"url":null,"abstract":"Background: The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams.Objective: This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam.Methods: A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs-one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom's taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated.Results: Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12-0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours).Conclusion: ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.","PeriodicalId":51234,"journal":{"name":"BMC Medical Education","volume":"25 1","pages":"208"},"PeriodicalIF":3.2000,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11806894/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Education","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12909-025-06796-6","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams.

Objective: This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam.

Methods: A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs-one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom's taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated.

Results: Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12-0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours).

Conclusion: ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

医学教育中人工智能与人工生成的多项选择题：高风险考试中的队列研究。

背景：创建高质量的选择题（mcq）对于医学教育评估至关重要，但如果由人类专家完成，则需要耗费大量资源和时间。像chatgpt - 40这样的大型语言模型（llm）提供了一个很有希望的替代方案，但它们的效果尚不清楚，特别是在高风险的考试中。目的：本研究旨在评估在高风险医疗执照考试中chatgpt - 40生成的mcq与人工生成的mcq的质量和心理测量学特性。方法：对2024年8月香港急诊医学院举办的急诊医学初级考试（PEEM）的备考医生进行前瞻性队列研究。参与者尝试了两组100个mcq，一组是人工智能生成的，一组是人工生成的。专家评审员评估mcq的事实正确性、相关性、难度、与Bloom分类法的一致性（记住、理解、应用和分析）以及项目写作缺陷。进行心理测量分析，包括困难度、辨别度指标和KR-20信度。候选人的表现和时间效率也进行了评估。结果：在24名参与者中，人工智能生成的mcq更容易（平均难度指数= 0.78±0.22 vs. 0.69±0.23,p）。结论：chatgpt - 40显示出有效生成mcq的潜力，但缺乏复杂评估所需的深度。人为审查对于确保质量仍然至关重要。将人工智能效率与专家监督相结合，可以优化高风险考试的出题方式，为医学教育提供一种可扩展的模式，平衡时间效率和内容质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Medical Education EDUCATION, SCIENTIFIC DISCIPLINES-

CiteScore

4.90

自引率

11.10%

发文量

795

审稿时长

6 months

期刊介绍： BMC Medical Education is an open access journal publishing original peer-reviewed research articles in relation to the training of healthcare professionals, including undergraduate, postgraduate, and continuing education. The journal has a special focus on curriculum development, evaluations of performance, assessment of training needs and evidence-based medicine.