Purpose: The present study aimed to compare the performance of seven AI chatbots (ChatGPT 3.5, 4, 4o, Gemini, Claude 3, Copilot, and Llama) with that of U.S. prosthodontic residents on the National Prosthodontic Resident Examination (NPRE) from 2011 to 2023.
Materials and methods: NPRE exam files and answer keys were obtained from the American College of Prosthodontists (ACP). AI-generated responses were collected using standardized prompts that requested only letter answer choices. Each chatbot's first response was recorded and compared to the official exam key. Performance was evaluated as the percentage of correct responses, and results were compared with the average scores of human participants. Statistical analysis included repeated measures ANOVA and Bonferroni post-hoc comparisons.
Results: Statistically significant differences were found between the different groups (p<0.001). ChatGPT 3.5 achieved the highest scores, followed by Claude 3 and Llama, while Copilot performed significantly worse than residents. Other chatbots demonstrated performance levels comparable to human participants, but with notable variability across AI models.
Discussion: The findings suggest that AI chatbots, particularly ChatGPT 3.5, can perform at or above the level of prosthodontic residents on standardized examinations. However, differences in chatbot capabilities, training data, and inconsistencies in performance raise concerns about reliability. While AI has potential as a supplementary learning tool, its use should be guided by critical evaluation.
Conclusions: AI chatbots demonstrate adequate performance on specialized academic exams, but variability across models highlights the need for further assessment before integration into dental education and professional practice.
扫码关注我们
求助内容:
应助结果提醒方式:
