首页 > 最新文献

Journal of Educational Evaluation for Health Professions最新文献

英文 中文
Performance of large language models in medical licensing examinations: a systematic review and meta-analysis. 大型语言模型在医师执照考试中的表现:系统回顾和荟萃分析。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-11-18 DOI: 10.3352/jeehp.2025.22.36
Haniyeh Nouri, Abdollah Mahdavi, Ali Abedi, Alireza Mohammadnia, Mahnaz Hamedan, Masoud Amanzadeh

Purpose: This study systematically evaluates and compares the performance of large language models (LLMs) in answering medical licensing examination questions. By conducting subgroup analyses based on language, question format, and model type, this meta-analysis aims to provide a comprehensive overview of LLM capabilities in medical education and clinical decision-making.

Methods: This systematic review, registered in PROSPERO and following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searched MEDLINE (PubMed), Scopus, and Web of Science for relevant articles published up to February 1, 2025. The search strategy included Medical Subject Headings (MeSH) terms and keywords related to ("ChatGPT" OR "GPT" OR "LLM variants") AND ("medical licensing exam*" OR "medical exam*" OR "medical education" OR "radiology exam*"). Eligible studies evaluated LLM accuracy on medical licensing examination questions. Pooled accuracy was estimated using a random-effects model, with subgroup analyses by LLM type, language, and question format. Publication bias was assessed using Egger's regression test.

Results: This systematic review identified 2,404 studies. After removing duplicates and excluding irrelevant articles through title and abstract screening, 36 studies were included after full-text review. The pooled accuracy was 72% (95% confidence interval, 70.0% to 75.0%) with high heterogeneity (I2=99%, P<0.001). Among LLMs, GPT-4 achieved the highest accuracy (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%) (P=0.001). Performance differences across languages (range, 62% in Polish to 77% in German) were not statistically significant (P=0.170).

Conclusion: LLMs, particularly GPT-4, can match or exceed medical students' examination performance and may serve as supportive educational tools. However, due to variability and the risk of errors, they should be used cautiously as complements rather than replacements for traditional learning methods.

目的:本研究系统评价和比较大语言模型(llm)在回答医师执照考试问题中的表现。通过基于语言、问题格式和模型类型的分组分析,本荟萃分析旨在全面概述法学硕士在医学教育和临床决策方面的能力。方法:本系统综述在PROSPERO注册,并遵循PRISMA(系统综述和荟萃分析的首选报告项目)指南,检索MEDLINE (PubMed)、Scopus和Web of Science,查找截至2025年2月1日发表的相关文章。搜索策略包括与(“ChatGPT”或“GPT”或“LLM变体”)和(“医学执照考试*”或“医学考试*”或“医学教育”或“放射学考试*”)相关的医学主题标题(MeSH)术语和关键字。合格的研究评估了LLM在医疗执照考试问题上的准确性。使用随机效应模型估计汇总准确性,并根据LLM类型、语言和问题格式进行亚组分析。采用Egger回归检验评估发表偏倚。结果:本系统综述确定了2404项研究。在通过标题和摘要筛选去除重复和排除不相关的文章后,全文审查后纳入了36项研究。结论:法学硕士,特别是GPT-4,可以达到或超过医学生的考试成绩,可以作为辅助教育工具。然而,由于可变性和错误的风险,它们应该谨慎地作为传统学习方法的补充而不是替代。
{"title":"Performance of large language models in medical licensing examinations: a systematic review and meta-analysis.","authors":"Haniyeh Nouri, Abdollah Mahdavi, Ali Abedi, Alireza Mohammadnia, Mahnaz Hamedan, Masoud Amanzadeh","doi":"10.3352/jeehp.2025.22.36","DOIUrl":"10.3352/jeehp.2025.22.36","url":null,"abstract":"<p><strong>Purpose: </strong>This study systematically evaluates and compares the performance of large language models (LLMs) in answering medical licensing examination questions. By conducting subgroup analyses based on language, question format, and model type, this meta-analysis aims to provide a comprehensive overview of LLM capabilities in medical education and clinical decision-making.</p><p><strong>Methods: </strong>This systematic review, registered in PROSPERO and following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searched MEDLINE (PubMed), Scopus, and Web of Science for relevant articles published up to February 1, 2025. The search strategy included Medical Subject Headings (MeSH) terms and keywords related to (\"ChatGPT\" OR \"GPT\" OR \"LLM variants\") AND (\"medical licensing exam*\" OR \"medical exam*\" OR \"medical education\" OR \"radiology exam*\"). Eligible studies evaluated LLM accuracy on medical licensing examination questions. Pooled accuracy was estimated using a random-effects model, with subgroup analyses by LLM type, language, and question format. Publication bias was assessed using Egger's regression test.</p><p><strong>Results: </strong>This systematic review identified 2,404 studies. After removing duplicates and excluding irrelevant articles through title and abstract screening, 36 studies were included after full-text review. The pooled accuracy was 72% (95% confidence interval, 70.0% to 75.0%) with high heterogeneity (I2=99%, P<0.001). Among LLMs, GPT-4 achieved the highest accuracy (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%) (P=0.001). Performance differences across languages (range, 62% in Polish to 77% in German) were not statistically significant (P=0.170).</p><p><strong>Conclusion: </strong>LLMs, particularly GPT-4, can match or exceed medical students' examination performance and may serve as supportive educational tools. However, due to variability and the risk of errors, they should be used cautiously as complements rather than replacements for traditional learning methods.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"36"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145542995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing generative artificial intelligence platforms and nursing student performance on a women's health nursing examination in Korea: a Rasch model approach. 比较生成人工智能平台和护理学生在韩国女性健康护理考试中的表现:Rasch模型方法。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-09-05 DOI: 10.3352/jeehp.2025.22.23
Eun Jeong Ko, Tae Kyung Lee, Geum Hee Jeong

Purpose: This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women's health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.

Methods: The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT Free, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.

Results: The items satisfied unidimensionality (DETECT=-0.16). Item difficulty parameter estimates ranged from -3.87 to 1.96 logits (mean=-0.61), with a mean difficulty index of 0.79. Examinees' ability parameter estimates ranged from -0.71 to 3.14 logits (mean=1.17). GPT-4o, ChatGPT Free, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, -0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.

Conclusion: Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.

目的:本心理测量学研究旨在比较生成式人工智能(AI)平台与韩国翰林大学护理专业学生在50项女性健康护理考试中的能力参数估计,使用Rasch模型。它还试图估算道具难度参数并评估AI在不同难度级别中的表现。方法:于2023年6月对111名护理专业四年级学生进行问卷调查,包括39项选择题和11项真假题。2024年12月,6个生成式AI平台(gpt - 40、ChatGPT Free、Claude。ai, Clova X, Mistral。ai, b谷歌Gemini)完成了相同的项目。使用Rasch模型对响应进行分析,以估计能力和难度参数。通过维数评估来枚举贡献性状(DETECT)验证单维性,并使用R包irtQ和TAM进行分析。结果:项目满足单维性(DETECT=-0.16)。道具难度参数估计范围从-3.87到1.96 logits(平均=-0.61),平均难度指数为0.79。考生的能力参数估计值范围为-0.71 ~ 3.14 logits(平均=1.17)。gpt - 40、ChatGPT Free和Claude。ai的表现优于学生能力中位数(1.09 logits),得分分别为2.68、2.34和2.34,而Clova X、Mistral。双子座的得分较低(0.20,-0.12,0.80)。测试信息曲线在θ=0以下达到峰值,表明适合低到中等水平的考生。结论:先进的生成式人工智能平台近似于优秀学生的表现,但结果有所不同。Rasch模型有效地评估了人工智能的能力,支持其在护理教育中未来人工智能绩效评估的潜在效用。
{"title":"Comparing generative artificial intelligence platforms and nursing student performance on a women's health nursing examination in Korea: a Rasch model approach.","authors":"Eun Jeong Ko, Tae Kyung Lee, Geum Hee Jeong","doi":"10.3352/jeehp.2025.22.23","DOIUrl":"10.3352/jeehp.2025.22.23","url":null,"abstract":"<p><strong>Purpose: </strong>This psychometric study aimed to compare the ability parameter estimates of generative artificial intelligence (AI) platforms with those of nursing students on a 50-item women's health nursing examination at Hallym University, Korea, using the Rasch model. It also sought to estimate item difficulty parameters and evaluate AI performance across varying difficulty levels.</p><p><strong>Methods: </strong>The exam, consisting of 39 multiple-choice items and 11 true/false items, was administered to 111 fourth-year nursing students in June 2023. In December 2024, 6 generative AI platforms (GPT-4o, ChatGPT Free, Claude.ai, Clova X, Mistral.ai, Google Gemini) completed the same items. The responses were analyzed using the Rasch model to estimate the ability and difficulty parameters. Unidimensionality was verified by the Dimensionality Evaluation to Enumerate Contributing Traits (DETECT), and analyses were conducted using the R packages irtQ and TAM.</p><p><strong>Results: </strong>The items satisfied unidimensionality (DETECT=-0.16). Item difficulty parameter estimates ranged from -3.87 to 1.96 logits (mean=-0.61), with a mean difficulty index of 0.79. Examinees' ability parameter estimates ranged from -0.71 to 3.14 logits (mean=1.17). GPT-4o, ChatGPT Free, and Claude.ai outperformed the median student ability (1.09 logits), scoring 2.68, 2.34, and 2.34, respectively, while Clova X, Mistral.ai, and Google Gemini exhibited lower scores (0.20, -0.12, 0.80). The test information curve peaked below θ=0, indicating suitability for examinees with low to average ability.</p><p><strong>Conclusion: </strong>Advanced generative AI platforms approximated the performance of high-performing students, but outcomes varied. The Rasch model effectively evaluated AI competency, supporting its potential utility for future AI performance assessments in nursing education.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"23"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12770907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145151345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decline in attrition rates in United States pediatric residency and fellowship programs, 2007-2020: a repeated cross-sectional study. 2007-2020年美国儿科住院医师和奖学金项目的流失率下降:一项重复的横断面研究。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-09-05 DOI: 10.3352/jeehp.2025.22.24
Emma Omoruyi, Greg Russell, Kimberly Montez

Purpose: Declining fill rates in US pediatric residency and subspecialty programs requires trainee retention. Attrition, defined as transfers, withdrawals, dismissals, unsuccessful completions, or deaths, disrupts program function and impacts the pediatric workforce pipeline. It aims to evaluate attrition trends among pediatric residents and fellows in Accreditation Council for Graduate Medical Education (ACGME)-accredited programs from 2007 to 2020.

Methods: This repeated cross-sectional study analyzed publicly available ACGME Data Resource Book records. Attrition rates and 95% confidence intervals (CIs) were calculated overall and by subspecialty. Logistic regression assessed temporal changes; odds ratios (ORs) compared 2020 to 2007.

Results: From 2007-2020, pediatric residents increased from 8,145 to 9,419 and fellows from 2,875 to 4,279. Aggregate annual resident attrition averaged 1.71% (range, 0.93%-2.64%), and fellow attrition ranged from 12.39%-30.87%. Transfer rates declined from 18.05 to 5.20 per 1,000 trainees (P<0.0001), withdrawals from 5.65 to 2.76 (P=0.030), and dismissals from 3.14 in 2010 to 1.27 in 2020 (P=0.0068). Odds of unsuccessful completion significantly decreased in categorical pediatrics (OR, 0.41; 95% CI, 0.29-0.58), pediatric cardiology (OR, 0.08; 95% CI, 0.01-0.64), pediatric critical care (OR, 0.14; 95% CI, 0.06-0.35), and neonatal-perinatal medicine (OR, 0.46; 95% CI, 0.20-1.08).

Conclusion: Although attrition has improved, premature trainee loss can still disrupt program operations and threaten workforce development. Attrition may reflect educational environment quality, support structures, or selection processes. Greater data transparency is needed to understand demographic trends and inform equitable retention strategies, ultimately strengthening training programs and sustaining the United States pediatric workforce.

目的:美国儿科住院医师和亚专科项目的填补率下降,需要保留培训生。人员流失,定义为转移、退出、解雇、不成功完成或死亡,破坏项目功能并影响儿科劳动力管道。它的目的是评估2007年至2020年在研究生医学教育认证委员会(ACGME)认可的项目中儿科住院医师和研究员的流失趋势。方法:这项重复的横断面研究分析了公开可用的ACGME数据资源书记录。流失率和95%置信区间(ci)计算总体和亚专业。Logistic回归评估时间变化;比值比(ORs)比较2020年和2007年。结果:从2007年到2020年,儿科住院医师从8145人增加到9419人,研究员从2875人增加到4279人。居民年总流失率平均为1.71%(范围0.93%-2.64%),同行流失率为12.39%-30.87%。转职率从18.05 / 1000下降到5.20 / 1000(结论:尽管流失率有所改善,但过早的学员流失仍然会扰乱项目运作,威胁劳动力发展。流失可能反映教育环境质量、支持结构或选拔过程。需要提高数据透明度,以了解人口趋势并为公平的保留策略提供信息,最终加强培训计划并维持美国儿科劳动力。
{"title":"Decline in attrition rates in United States pediatric residency and fellowship programs, 2007-2020: a repeated cross-sectional study.","authors":"Emma Omoruyi, Greg Russell, Kimberly Montez","doi":"10.3352/jeehp.2025.22.24","DOIUrl":"10.3352/jeehp.2025.22.24","url":null,"abstract":"<p><strong>Purpose: </strong>Declining fill rates in US pediatric residency and subspecialty programs requires trainee retention. Attrition, defined as transfers, withdrawals, dismissals, unsuccessful completions, or deaths, disrupts program function and impacts the pediatric workforce pipeline. It aims to evaluate attrition trends among pediatric residents and fellows in Accreditation Council for Graduate Medical Education (ACGME)-accredited programs from 2007 to 2020.</p><p><strong>Methods: </strong>This repeated cross-sectional study analyzed publicly available ACGME Data Resource Book records. Attrition rates and 95% confidence intervals (CIs) were calculated overall and by subspecialty. Logistic regression assessed temporal changes; odds ratios (ORs) compared 2020 to 2007.</p><p><strong>Results: </strong>From 2007-2020, pediatric residents increased from 8,145 to 9,419 and fellows from 2,875 to 4,279. Aggregate annual resident attrition averaged 1.71% (range, 0.93%-2.64%), and fellow attrition ranged from 12.39%-30.87%. Transfer rates declined from 18.05 to 5.20 per 1,000 trainees (P<0.0001), withdrawals from 5.65 to 2.76 (P=0.030), and dismissals from 3.14 in 2010 to 1.27 in 2020 (P=0.0068). Odds of unsuccessful completion significantly decreased in categorical pediatrics (OR, 0.41; 95% CI, 0.29-0.58), pediatric cardiology (OR, 0.08; 95% CI, 0.01-0.64), pediatric critical care (OR, 0.14; 95% CI, 0.06-0.35), and neonatal-perinatal medicine (OR, 0.46; 95% CI, 0.20-1.08).</p><p><strong>Conclusion: </strong>Although attrition has improved, premature trainee loss can still disrupt program operations and threaten workforce development. Attrition may reflect educational environment quality, support structures, or selection processes. Greater data transparency is needed to understand demographic trends and inform equitable retention strategies, ultimately strengthening training programs and sustaining the United States pediatric workforce.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"24"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12676133/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145151426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical effect of the Dr LEE Jong-wook Fellowship Program to empower sustainable change for the health workforce in Tanzania: a mixed-methods study 李钟郁博士奖学金计划对增强坦桑尼亚卫生人力可持续变革的经验效应:一项混合方法研究。
IF 9.3 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-01-20 DOI: 10.3352/jeehp.2025.22.6
Masoud Dauda, Swabaha Aidarus Yusuph, Harouni Yasini, Issa Mmbaga, Perpetua Mwambinngu, Hansol Park, Gyeongbae Seo, Kyoung Kyun Oh

Purpose: This study evaluated the Dr LEE Jong-wook Fellowship Program’s impact on Tanzania’s health workforce, focusing on relevance, effectiveness, efficiency, impact, and sustainability in addressing healthcare gaps.

Methods: A mixed-methods research design was employed. Data were collected from 97 out of 140 alumni through an online survey, 35 in-depth interviews, and one focus group discussion. The study was conducted from November to December 2023 and included alumni from 2009 to 2022. Measurement instruments included structured questionnaires for quantitative data and semi-structured guides for qualitative data. Quantitative analysis involved descriptive and inferential statistics (Spearman’s rank correlation, non-parametric tests) using Python ver. 3.11.0 and Stata ver. 14.0. Thematic analysis was employed to analyze qualitative data using NVivo ver. 12.0.

Results: Findings indicated high relevance (mean=91.6, standard deviation [SD]=8.6), effectiveness (mean=86.1, SD=11.2), efficiency (mean=82.7, SD=10.2), and impact (mean=87.7, SD=9.9), with improved skills, confidence, and institutional service quality. However, sustainability had a lower score (mean=58.0, SD=11.1), reflecting challenges in follow-up support and resource allocation. Effectiveness strongly correlated with impact (ρ=0.746, P<0.001). The qualitative findings revealed that participants valued tailored training but highlighted barriers, such as language challenges and insufficient practical components. Alumni-led initiatives contributed to knowledge sharing, but limited resources constrained sustainability.

Conclusion: The Fellowship Program enhanced Tanzania’s health workforce capacity, but it requires localized curricula and strengthened alumni networks for sustainability. These findings provide actionable insights for improving similar programs globally, confirming the hypothesis that tailored training positively influences workforce and institutional outcomes.

目的:本研究评估了Lee Jong-wook博士奖学金计划对坦桑尼亚卫生人力的影响,重点关注解决卫生保健差距的相关性、有效性、效率、影响和可持续性。方法:采用混合方法研究设计。通过在线调查、35次深度访谈和一次焦点小组讨论,从140名校友中收集了97名校友的数据。该研究于2023年11月至12月进行,包括2009年至2022年的校友。测量工具包括用于定量数据的结构化问卷和用于定性数据的半结构化指南。定量分析涉及使用Python ver的描述性和推断性统计(Spearman等级相关,非参数检验)。3.11.0和Stata版本。14.0. 采用NVivo ver对定性数据进行专题分析。12.0.结果:研究结果显示相关性(均值=91.6,标准差[SD]=8.6)、有效性(均值=86.1,SD=11.2)、效率(均值=82.7,SD=10.2)和影响(均值=87.7,SD=9.9)较高,技能、信心和机构服务质量均有所提高。然而,可持续性得分较低(平均值=58.0,SD=11.1),反映了后续支持和资源分配方面的挑战。结论:奖学金项目提高了坦桑尼亚卫生人力的能力,但它需要本地化的课程和加强校友网络,以实现可持续性。这些发现为改善全球类似项目提供了可行的见解,证实了定制培训对劳动力和机构结果产生积极影响的假设。
{"title":"Empirical effect of the Dr LEE Jong-wook Fellowship Program to empower sustainable change for the health workforce in Tanzania: a mixed-methods study","authors":"Masoud Dauda, Swabaha Aidarus Yusuph, Harouni Yasini, Issa Mmbaga, Perpetua Mwambinngu, Hansol Park, Gyeongbae Seo, Kyoung Kyun Oh","doi":"10.3352/jeehp.2025.22.6","DOIUrl":"10.3352/jeehp.2025.22.6","url":null,"abstract":"<p><strong>Purpose: </strong>This study evaluated the Dr LEE Jong-wook Fellowship Program’s impact on Tanzania’s health workforce, focusing on relevance, effectiveness, efficiency, impact, and sustainability in addressing healthcare gaps.</p><p><strong>Methods: </strong>A mixed-methods research design was employed. Data were collected from 97 out of 140 alumni through an online survey, 35 in-depth interviews, and one focus group discussion. The study was conducted from November to December 2023 and included alumni from 2009 to 2022. Measurement instruments included structured questionnaires for quantitative data and semi-structured guides for qualitative data. Quantitative analysis involved descriptive and inferential statistics (Spearman’s rank correlation, non-parametric tests) using Python ver. 3.11.0 and Stata ver. 14.0. Thematic analysis was employed to analyze qualitative data using NVivo ver. 12.0.</p><p><strong>Results: </strong>Findings indicated high relevance (mean=91.6, standard deviation [SD]=8.6), effectiveness (mean=86.1, SD=11.2), efficiency (mean=82.7, SD=10.2), and impact (mean=87.7, SD=9.9), with improved skills, confidence, and institutional service quality. However, sustainability had a lower score (mean=58.0, SD=11.1), reflecting challenges in follow-up support and resource allocation. Effectiveness strongly correlated with impact (ρ=0.746, P<0.001). The qualitative findings revealed that participants valued tailored training but highlighted barriers, such as language challenges and insufficient practical components. Alumni-led initiatives contributed to knowledge sharing, but limited resources constrained sustainability.</p><p><strong>Conclusion: </strong>The Fellowship Program enhanced Tanzania’s health workforce capacity, but it requires localized curricula and strengthened alumni networks for sustainability. These findings provide actionable insights for improving similar programs globally, confirming the hypothesis that tailored training positively influences workforce and institutional outcomes.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"6"},"PeriodicalIF":9.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12003955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143013022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empathy and tolerance of ambiguity in medical students and doctors participating in art-based observational training at the Rijksmuseum in Amsterdam, the Netherlands: a before-and-after study 在荷兰阿姆斯特丹国立博物馆参加以艺术为基础的观察训练的医学生和医生对歧义的移情和容忍:一项前后研究。
IF 9.3 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-01-14 DOI: 10.3352/jeehp.2025.22.3
Stella Anna Bult, Thomas van Gulik

Purpose: This research presents an experimental study using validated questionnaires to quantitatively assess the outcomes of art-based observational training in medical students, residents, and specialists. The study tested the hypothesis that art-based observational training would lead to measurable effects on judgement skills (tolerance of ambiguity) and empathy in medical students and doctors.

Methods: An experimental cohort study with pre- and post-intervention assessments was conducted using validated questionnaires and qualitative evaluation forms to examine the outcomes of art-based observational training in medical students and doctors. Between December 2023 and June 2024, 15 art courses were conducted in the Rijksmuseum in Amsterdam. Participants were assessed on empathy using the Jefferson Scale of Empathy (JSE) and tolerance of ambiguity using the Tolerance of Ambiguity in Medical Students and Doctors (TAMSAD) scale.

Results: In total, 91 participants were included; 29 participants completed the JSE and 62 completed the TAMSAD scales. The results showed statistically significant post-test increases for mean JSE and TAMSAD scores (3.71 points for the JSE, ranging from 20 to 140, and 1.86 points for the TAMSAD, ranging from 0 to 100). The qualitative findings were predominantly positive.

Conclusion: The results suggest that incorporating art-based observational training in medical education improves empathy and tolerance of ambiguity. This study highlights the importance of art-based observational training in medical education in the professional development of medical students and doctors.

目的:本研究提出了一项实验研究,使用有效的问卷来定量评估医学生、住院医师和专科医生基于艺术的观察训练的结果。该研究测试了基于艺术的观察训练对医学生和医生的判断技能(模糊容忍度)和同理心的可测量影响的假设。方法:采用实验队列研究,采用有效问卷和定性评估表对医学生和医生进行艺术观察训练的效果进行干预前和干预后评估。从2023年12月到2024年6月,在阿姆斯特丹国立博物馆举办了15门艺术课程。使用杰弗逊共情量表(JSE)评估被试的共情能力,使用医学生和医生的歧义容忍度量表(TAMSAD)评估被试的歧义容忍度。结果:共纳入91名受试者;29名参与者完成了JSE, 62名完成了TAMSAD量表。结果显示,测试后JSE和TAMSAD的平均得分均有统计学意义上的提高(JSE为3.71分,范围从20到140,TAMSAD为1.86分,范围从0到100)。定性结果主要是积极的。结论:在医学教育中引入以艺术为基础的观察训练可以提高学生的共情能力和对歧义的容忍度。本研究强调了医学教育中以艺术为基础的观察训练在医学生和医生专业发展中的重要性。
{"title":"Empathy and tolerance of ambiguity in medical students and doctors participating in art-based observational training at the Rijksmuseum in Amsterdam, the Netherlands: a before-and-after study","authors":"Stella Anna Bult, Thomas van Gulik","doi":"10.3352/jeehp.2025.22.3","DOIUrl":"10.3352/jeehp.2025.22.3","url":null,"abstract":"<p><strong>Purpose: </strong>This research presents an experimental study using validated questionnaires to quantitatively assess the outcomes of art-based observational training in medical students, residents, and specialists. The study tested the hypothesis that art-based observational training would lead to measurable effects on judgement skills (tolerance of ambiguity) and empathy in medical students and doctors.</p><p><strong>Methods: </strong>An experimental cohort study with pre- and post-intervention assessments was conducted using validated questionnaires and qualitative evaluation forms to examine the outcomes of art-based observational training in medical students and doctors. Between December 2023 and June 2024, 15 art courses were conducted in the Rijksmuseum in Amsterdam. Participants were assessed on empathy using the Jefferson Scale of Empathy (JSE) and tolerance of ambiguity using the Tolerance of Ambiguity in Medical Students and Doctors (TAMSAD) scale.</p><p><strong>Results: </strong>In total, 91 participants were included; 29 participants completed the JSE and 62 completed the TAMSAD scales. The results showed statistically significant post-test increases for mean JSE and TAMSAD scores (3.71 points for the JSE, ranging from 20 to 140, and 1.86 points for the TAMSAD, ranging from 0 to 100). The qualitative findings were predominantly positive.</p><p><strong>Conclusion: </strong>The results suggest that incorporating art-based observational training in medical education improves empathy and tolerance of ambiguity. This study highlights the importance of art-based observational training in medical education in the professional development of medical students and doctors.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"3"},"PeriodicalIF":9.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11880821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142980319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison between GPT-4 and human raters in grading pharmacy students' exam responses in Malaysia: a cross-sectional study. GPT-4和人类评分者在马来西亚对药学学生考试反应评分的比较:一项横断面研究。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-07-28 DOI: 10.3352/jeehp.2025.22.20
Wuan Shuen Yap, Pui San Saw, Li Ling Yeap, Shaun Wen Huey Lee, Wei Jin Wong, Ronald Fook Seng Lee

Purpose: Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4's potential in grading pharmacy students' exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4's consistency over time, and determined its error rates in grading pharmacy students' exam responses.

Methods: We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.

Results: GPT-4's ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617-0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=-5.71-4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.

Conclusion: With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students' exam responses, particularly for objective tasks. However, its limitations-including errors and variability in grading high-scoring responses-require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.

目的:人工评分耗时长,且容易出现不一致性,促使我们探索生成式人工智能工具,如GPT-4,以提高效率和可靠性。本研究探讨了GPT-4对药学学生考试成绩评分的潜力,重点关注优化提示的影响。具体来说,它评估了GPT-4与人类评分者之间的一致性,评估了GPT-4随时间的一致性,并确定了其在给药学学生考试反应评分时的错误率。方法:我们使用由大学训练的评分员和GPT-4评分的过去的考试答案进行了比较研究。在GPT-4评估之前,应答是随机的,并在2024年4月至9月期间通过Plus账户访问。对16份问卷进行即时优化,并对3种即时送达方式进行评价。然后,我们将优化的方法应用于4个项目类型。使用类内相关系数和误差分析来评估GPT-4和人类评分之间的一致性和一致性。结果:GPT-4的评分与人类评分者相当一致,表现出中等至优异的信度(类内相关系数=0.617-0.933),取决于项目类型和优化提示。当按年级等级分层时,GPT-4在评分高分回答时不太一致(Z=-5.71-4.62)。结论:通过优化提示,GPT-4有望成为评分药学学生考试回答的辅助工具,特别是对于客观任务。然而,它的局限性——包括评分高分反应的错误和可变性——需要持续的人工监督。未来的研究应探索先进的生成式人工智能模型和更广泛的评估格式,以进一步提高评分的可靠性。
{"title":"Comparison between GPT-4 and human raters in grading pharmacy students' exam responses in Malaysia: a cross-sectional study.","authors":"Wuan Shuen Yap, Pui San Saw, Li Ling Yeap, Shaun Wen Huey Lee, Wei Jin Wong, Ronald Fook Seng Lee","doi":"10.3352/jeehp.2025.22.20","DOIUrl":"https://doi.org/10.3352/jeehp.2025.22.20","url":null,"abstract":"<p><strong>Purpose: </strong>Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4's potential in grading pharmacy students' exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4's consistency over time, and determined its error rates in grading pharmacy students' exam responses.</p><p><strong>Methods: </strong>We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.</p><p><strong>Results: </strong>GPT-4's ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617-0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=-5.71-4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.</p><p><strong>Conclusion: </strong>With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students' exam responses, particularly for objective tasks. However, its limitations-including errors and variability in grading high-scoring responses-require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"20"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145151398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Longitudinal relationships between Korean medical students' academic performance in medical knowledge and clinical performance examinations: a retrospective longitudinal study. 韩国医学生医学知识学习成绩与临床表现考核的纵向关系:回顾性纵向研究。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-06-10 DOI: 10.3352/jeehp.2025.22.18
Yulim Kang, Hae Won Kim

Purpose: This study investigated the longitudinal relationships between performance on 3 examinations assessing medical knowledge and clinical skills among Korean medical students in the clinical phase. This study addressed the stability of each examination score and the interrelationships among examinations over time.

Methods: A retrospective longitudinal study was conducted at Yonsei University College of Medicine in Korea with a cohort of 112 medical students over 2 years. The students were in their third year in 2022 and progressed to the fourth year in 2023. We obtained comprehensive clinical science examination (CCSE) and progress test (PT) scores 3 times (T1-T3), and clinical performance examination (CPX) scores twice (T1 and T2). Autoregressive cross-lagged models were fitted to analyze their relationships.

Results: For each of the 3 examinations, the score at 1 time point predicted the subsequent score. Regarding cross-lagged effects, the CCSE at T1 predicted PT at T2 (β=0.472, P<0.001) and CCSE at T2 predicted PT at T3 (β=0.527, P<0.001). The CPX at T1 predicted the CCSE at T2 (β=0.163, P=0.006), and the CPX at T2 predicted the CCSE at T3 (β=0.154, P=0.006). The PT at T1 predicted the CPX at T2 (β=0.273, P=0.006).

Conclusion: The study identified each examination's stability and the complexity of the longitudinal relationships between them. These findings may help predict medical students' performance on subsequent examinations, potentially informing the provision of necessary student support.

目的:探讨韩国医学生临床阶段医学知识与临床技能三项考试成绩的纵向关系。本研究探讨了各考试成绩的稳定性以及各考试之间随时间的相互关系。方法:在韩国延世大学医学院对112名医学生进行了为期2年的回顾性纵向研究。这些学生于2022年进入三年级,并于2023年进入四年级。临床综合科学检查(CCSE)和进展试验(PT)评分3次(T1- t3),临床表现检查(CPX)评分2次(T1和T2)。拟合自回归交叉滞后模型来分析它们之间的关系。结果:3次考试中,每一次考试的1个时间点得分预测后续考试的得分。关于交叉滞后效应,T1时的CCSE预测T2时的PT (β=0.472, p)。结论:研究确定了各项检查的稳定性和它们之间纵向关系的复杂性。这些发现可能有助于预测医学生在随后的考试中的表现,潜在地为提供必要的学生支持提供信息。
{"title":"Longitudinal relationships between Korean medical students' academic performance in medical knowledge and clinical performance examinations: a retrospective longitudinal study.","authors":"Yulim Kang, Hae Won Kim","doi":"10.3352/jeehp.2025.22.18","DOIUrl":"10.3352/jeehp.2025.22.18","url":null,"abstract":"<p><strong>Purpose: </strong>This study investigated the longitudinal relationships between performance on 3 examinations assessing medical knowledge and clinical skills among Korean medical students in the clinical phase. This study addressed the stability of each examination score and the interrelationships among examinations over time.</p><p><strong>Methods: </strong>A retrospective longitudinal study was conducted at Yonsei University College of Medicine in Korea with a cohort of 112 medical students over 2 years. The students were in their third year in 2022 and progressed to the fourth year in 2023. We obtained comprehensive clinical science examination (CCSE) and progress test (PT) scores 3 times (T1-T3), and clinical performance examination (CPX) scores twice (T1 and T2). Autoregressive cross-lagged models were fitted to analyze their relationships.</p><p><strong>Results: </strong>For each of the 3 examinations, the score at 1 time point predicted the subsequent score. Regarding cross-lagged effects, the CCSE at T1 predicted PT at T2 (β=0.472, P<0.001) and CCSE at T2 predicted PT at T3 (β=0.527, P<0.001). The CPX at T1 predicted the CCSE at T2 (β=0.163, P=0.006), and the CPX at T2 predicted the CCSE at T3 (β=0.154, P=0.006). The PT at T1 predicted the CPX at T2 (β=0.273, P=0.006).</p><p><strong>Conclusion: </strong>The study identified each examination's stability and the complexity of the longitudinal relationships between them. These findings may help predict medical students' performance on subsequent examinations, potentially informing the provision of necessary student support.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"18"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365683/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144267588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validity of the formative physical therapy Student and Clinical Instructor Performance Instrument in the United States: a quasi-experimental, time-series study. 形成性物理治疗学生和临床指导员表现工具在美国的有效性:一项准实验,时间序列研究。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-09-26 DOI: 10.3352/jeehp.2025.22.26
Sean Gallivan, Jamie Bayliss

Purpose: The aim of this study was to assess the validity of the Student and Clinical Instructor Performance Instrument (SCIPAI), a novel formative tool used in physical therapist education to assess student and clinical instructor (CI) performance throughout clinical education experiences (CEEs). The researchers hypothesized that the SCIPAI would demonstrate concurrent, predictive, and construct validity while offering additional contemporary validity evidence.

Methods: This quasi-experimental, time-series study had 811 student-CI pairs complete 2 SCIPAIs before after CEE midpoint, and an endpoint Clinical Performance Instrument (CPI) during beginning to terminal CEEs in a 1-year period. Spearman rank correlation analyses used final SCIPAI and CPI like-item scores to assess concurrent validity; and earlier SCIPAI and final CPI like-item scores to assess predictive validity. Construct validity was assessed via progression of student and CI performance scores within CEEs using Wilcoxon signed-rank testing. No randomization/grouping of subjects occurred.

Results: Moderate correlation existed between like final SCIPAI and CPI items (P<0.005) and between some like items of earlier SCIPAIs and final CPIs (P<0.005). Student performance scores demonstrated progress from SCIPAIs 1 to 4 within CEEs (P<0.005). While a greater number of CIs demonstrated progression rather than regression in performance from SCIPAI 1 to SCIPAI 4, the greater magnitude of decreases in CI performance contributed to an aggregate ratings decrease of CI performance (P<0.005).

Conclusion: The SCIPAI demonstrates concurrent, predictive, and construct validity when used by students and CIs to rate student performance at regular points throughout clinical education experiences.

目的:本研究的目的是评估学生和临床教师表现量表(SCIPAI)的有效性,SCIPAI是一种用于物理治疗师教育的新型形成工具,用于评估学生和临床教师(CI)在临床教育经历(cee)中的表现。研究人员假设SCIPAI在提供额外的当代效度证据的同时,会表现出并发效度、预测性效度和建构效度。方法:这项准实验的时间序列研究有811对学生- ci对,在CEE中点之前完成了2次SCIPAIs,并在CEE开始到结束的1年期间完成了终点临床表现仪(CPI)。Spearman秩相关分析采用最终SCIPAI和CPI相似项目得分评估并发效度;以及早期SCIPAI和最终CPI类项目得分来评估预测有效性。结构效度通过学生的进步和CI表现分数在cee中使用Wilcoxon符号秩检验来评估。未对受试者进行随机分组。结果:最终SCIPAI与CPI项目之间存在中等程度的相关性(p)。结论:SCIPAI在临床教育过程中被学生和ci用于评估学生在常规点的表现时具有并发效度、预测效度和结构效度。
{"title":"Validity of the formative physical therapy Student and Clinical Instructor Performance Instrument in the United States: a quasi-experimental, time-series study.","authors":"Sean Gallivan, Jamie Bayliss","doi":"10.3352/jeehp.2025.22.26","DOIUrl":"10.3352/jeehp.2025.22.26","url":null,"abstract":"<p><strong>Purpose: </strong>The aim of this study was to assess the validity of the Student and Clinical Instructor Performance Instrument (SCIPAI), a novel formative tool used in physical therapist education to assess student and clinical instructor (CI) performance throughout clinical education experiences (CEEs). The researchers hypothesized that the SCIPAI would demonstrate concurrent, predictive, and construct validity while offering additional contemporary validity evidence.</p><p><strong>Methods: </strong>This quasi-experimental, time-series study had 811 student-CI pairs complete 2 SCIPAIs before after CEE midpoint, and an endpoint Clinical Performance Instrument (CPI) during beginning to terminal CEEs in a 1-year period. Spearman rank correlation analyses used final SCIPAI and CPI like-item scores to assess concurrent validity; and earlier SCIPAI and final CPI like-item scores to assess predictive validity. Construct validity was assessed via progression of student and CI performance scores within CEEs using Wilcoxon signed-rank testing. No randomization/grouping of subjects occurred.</p><p><strong>Results: </strong>Moderate correlation existed between like final SCIPAI and CPI items (P<0.005) and between some like items of earlier SCIPAIs and final CPIs (P<0.005). Student performance scores demonstrated progress from SCIPAIs 1 to 4 within CEEs (P<0.005). While a greater number of CIs demonstrated progression rather than regression in performance from SCIPAI 1 to SCIPAI 4, the greater magnitude of decreases in CI performance contributed to an aggregate ratings decrease of CI performance (P<0.005).</p><p><strong>Conclusion: </strong>The SCIPAI demonstrates concurrent, predictive, and construct validity when used by students and CIs to rate student performance at regular points throughout clinical education experiences.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"26"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12688320/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145150958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correlation between task-based checklists and global rating scores in undergraduate objective structured clinical examinations in Saudi Arabia: a 1-year comparative study. 任务型清单与沙特阿拉伯本科生客观结构化临床检查的总体评分之间的相关性:一项为期1年的比较研究。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-06-19 DOI: 10.3352/jeehp.2025.22.19
Uzma Khan, Yasir Naseem Khan

Purpose: This study investigated the correlation between task-based checklist scores and global rating scores (GRS) in objective structured clinical examinations (OSCEs) for fourth-year undergraduate medical students and aimed to determine whether both methods can be reliably used in a standard setting.

Methods: A comparative observational study was conducted at Al Rayan College of Medicine, Saudi Arabia, involving 93 fourth-year students during the 2023-2024 academic year. OSCEs from 2 General Practice courses were analyzed, each comprising 10 stations assessing clinical competencies. Students were scored using both task-specific checklists and holistic 5-point GRS. Reliability was evaluated using Cronbach's α, and the relationship between the 2 scoring methods was assessed using the coefficient of determination (R2). Ethical approval and informed consent were obtained.

Results: The mean OSCE score was 76.7 in Course 1 (Cronbach's α=0.85) and 73.0 in Course 2 (Cronbach's α=0.81). R2 values varied by station and competency. Strong correlations were observed in procedural and management skills (R2 up to 0.87), while weaker correlations appeared in history-taking stations (R2 as low as 0.35). The variability across stations highlighted the context-dependence of alignment between checklist and GRS methods.

Conclusion: Both checklists and GRS exhibit reliable psychometric properties. Their combined use improves validity in OSCE scoring, but station-specific application is recommended. Checklists may anchor pass/fail decisions, while GRS may assist in assessing borderline performance. This hybrid model increases fairness and reflects clinical authenticity in competency-based assessment.

目的:本研究调查了四年制本科医学生客观结构化临床检查(oses)中任务型检查表得分与总体评分评分(GRS)之间的相关性,旨在确定这两种方法是否可以可靠地用于标准设置。方法:在沙特阿拉伯Al Rayan医学院进行了一项比较观察研究,涉及93名2023-2024学年的四年级学生。对2个全科医学课程的osce进行分析,每个课程包括10个评估临床能力的站点。学生们使用特定任务清单和整体5分GRS评分。采用Cronbach’s α评价信度,采用决定系数(R2)评价两种评分方法之间的关系。获得了伦理批准和知情同意。结果:课程1的平均OSCE评分为76.7分(Cronbach’s α=0.85),课程2的平均OSCE评分为73.0分(Cronbach’s α=0.81)。R2值因岗位和能力而异。在程序和管理技能方面存在较强的相关性(R2达0.87),而在历史采集站存在较弱的相关性(R2低至0.35)。不同站点间的差异突出了核对表和GRS方法比对的环境依赖性。结论:核对表和GRS均具有可靠的心理测量特性。它们的联合使用提高了OSCE评分的有效性,但推荐针对特定站点的应用。检查表可以锚定通过/不通过的决定,而GRS可以帮助评估临界性能。这种混合模型增加了公平性,并反映了基于能力评估的临床真实性。
{"title":"Correlation between task-based checklists and global rating scores in undergraduate objective structured clinical examinations in Saudi Arabia: a 1-year comparative study.","authors":"Uzma Khan, Yasir Naseem Khan","doi":"10.3352/jeehp.2025.22.19","DOIUrl":"10.3352/jeehp.2025.22.19","url":null,"abstract":"<p><strong>Purpose: </strong>This study investigated the correlation between task-based checklist scores and global rating scores (GRS) in objective structured clinical examinations (OSCEs) for fourth-year undergraduate medical students and aimed to determine whether both methods can be reliably used in a standard setting.</p><p><strong>Methods: </strong>A comparative observational study was conducted at Al Rayan College of Medicine, Saudi Arabia, involving 93 fourth-year students during the 2023-2024 academic year. OSCEs from 2 General Practice courses were analyzed, each comprising 10 stations assessing clinical competencies. Students were scored using both task-specific checklists and holistic 5-point GRS. Reliability was evaluated using Cronbach's α, and the relationship between the 2 scoring methods was assessed using the coefficient of determination (R2). Ethical approval and informed consent were obtained.</p><p><strong>Results: </strong>The mean OSCE score was 76.7 in Course 1 (Cronbach's α=0.85) and 73.0 in Course 2 (Cronbach's α=0.81). R2 values varied by station and competency. Strong correlations were observed in procedural and management skills (R2 up to 0.87), while weaker correlations appeared in history-taking stations (R2 as low as 0.35). The variability across stations highlighted the context-dependence of alignment between checklist and GRS methods.</p><p><strong>Conclusion: </strong>Both checklists and GRS exhibit reliable psychometric properties. Their combined use improves validity in OSCE scoring, but station-specific application is recommended. Checklists may anchor pass/fail decisions, while GRS may assist in assessing borderline performance. This hybrid model increases fairness and reflects clinical authenticity in competency-based assessment.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"19"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365684/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144776471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging feedback mechanisms to improve the quality of objective structured clinical examinations in Singapore: an exploratory action research study. 利用反馈机制来提高新加坡客观结构化临床检查的质量:一项探索性行动研究。
IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES Pub Date : 2025-01-01 Epub Date: 2025-09-30 DOI: 10.3352/jeehp.2025.22.28
Han Ting Jillian Yeo, Dujeepa Dasharatha Samarasekera, Michael Dean

Purpose: Variability in examiner scoring threatens the fairness and reliability of objective structured clinical examinations (OSCEs). While examiner standardization exists, there is currently no structured, psychometric-informed, individualized feedback mechanism for examiners. This study explored the feasibility and perceived value of such a mechanism using an action research approach to co-design and iteratively refine examiner feedback reports.

Methods: Two exploratory cycles were conducted between November 2023 and June 2024 with phase 4 OSCE examiners at the Yong Loo Lin School of Medicine. In cycle 1, psychometric analyses of examiner scoring for a phase 4 OSCE informed the design of individualized reports, which were evaluated through interviews. Revisions were made to the format of the report and implemented in cycle 2, where examiner responses were again collected. Data were analyzed thematically, supported by reflective logs and field notes.

Results: Nine examiners participated in cycle 1 and 7 in cycle 2. In cycle 1, examiners highlighted challenges in interpreting complex terminology, leading to report refinements such as glossaries and visual graphs. In cycle 2, examiners demonstrated greater confidence in applying feedback, requested longitudinal reports, and shifted from initial resistance to reflective engagement. Across cycles, the reports improved credibility, neutrality, and examiner self-regulation.

Conclusion: This exploratory study suggests that psychometric-informed feedback reports can facilitate examiner reflection and transparency in OSCEs. While the findings highlight feasibility and examiner acceptance, longitudinal delivery of feedback, collection of quantitative outcome data, and larger samples are needed to establish whether such reports improve scoring consistency and assessment fairness.

目的:审查员评分的可变性威胁到客观结构化临床检查(oses)的公平性和可靠性。虽然考官标准化存在,但目前还没有结构化的、心理测量学的、个性化的考官反馈机制。本研究利用行动研究方法来共同设计和迭代改进审查员反馈报告,探索了这种机制的可行性和感知价值。方法:在2023年11月至2024年6月期间与永禄林医学院的欧安组织第4期检查员进行了两个探索周期。在第一阶段,欧安组织对审查员评分的心理测量分析为个性化报告的设计提供了信息,这些报告通过访谈进行评估。对报告的格式进行了修订,并在第2周期实施,再次收集审查员的答复。对数据进行了专题分析,并辅以反思日志和实地记录。结果:第1周期有9名审查员参与,第2周期有7名审查员参与。在周期1中,审查员强调了解释复杂术语的挑战,从而导致报告的细化,如词汇表和可视化图表。在周期2中,审查员在应用反馈方面表现出更大的信心,要求纵向报告,并从最初的抵制转变为反思参与。在整个周期中,报告提高了可信度、中立性和审查员的自我监管。结论:本探索性研究表明,心理测量告知反馈报告可以促进osce审查员的反思和透明度。虽然研究结果强调了可行性和审查员的接受度,但需要纵向反馈、收集定量结果数据和更大的样本来确定此类报告是否提高了评分一致性和评估公平性。
{"title":"Leveraging feedback mechanisms to improve the quality of objective structured clinical examinations in Singapore: an exploratory action research study.","authors":"Han Ting Jillian Yeo, Dujeepa Dasharatha Samarasekera, Michael Dean","doi":"10.3352/jeehp.2025.22.28","DOIUrl":"10.3352/jeehp.2025.22.28","url":null,"abstract":"<p><strong>Purpose: </strong>Variability in examiner scoring threatens the fairness and reliability of objective structured clinical examinations (OSCEs). While examiner standardization exists, there is currently no structured, psychometric-informed, individualized feedback mechanism for examiners. This study explored the feasibility and perceived value of such a mechanism using an action research approach to co-design and iteratively refine examiner feedback reports.</p><p><strong>Methods: </strong>Two exploratory cycles were conducted between November 2023 and June 2024 with phase 4 OSCE examiners at the Yong Loo Lin School of Medicine. In cycle 1, psychometric analyses of examiner scoring for a phase 4 OSCE informed the design of individualized reports, which were evaluated through interviews. Revisions were made to the format of the report and implemented in cycle 2, where examiner responses were again collected. Data were analyzed thematically, supported by reflective logs and field notes.</p><p><strong>Results: </strong>Nine examiners participated in cycle 1 and 7 in cycle 2. In cycle 1, examiners highlighted challenges in interpreting complex terminology, leading to report refinements such as glossaries and visual graphs. In cycle 2, examiners demonstrated greater confidence in applying feedback, requested longitudinal reports, and shifted from initial resistance to reflective engagement. Across cycles, the reports improved credibility, neutrality, and examiner self-regulation.</p><p><strong>Conclusion: </strong>This exploratory study suggests that psychometric-informed feedback reports can facilitate examiner reflection and transparency in OSCEs. While the findings highlight feasibility and examiner acceptance, longitudinal delivery of feedback, collection of quantitative outcome data, and larger samples are needed to establish whether such reports improve scoring consistency and assessment fairness.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"28"},"PeriodicalIF":3.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12768547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Evaluation for Health Professions
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1