Introduction: Artificial intelligence (AI), particularly large language models (LLMs), is transforming healthcare education and clinical decision-making. While models like ChatGPT and Claude have demonstrated utility in medical contexts, their performance in dental diagnostics remains underexplored; additionally, the potential of emerging platforms, like Manus, is yet to be evaluated.
Objective: To compare the diagnostic accuracy and consistency of the ChatGPT, Claude, and Manus-using authentic, case-based dental scenarios.
Methods: A set of 117 multiple-choice questions based on validated clinical dental vignettes spanning various specialities was administered to each model under standardised conditions at two separate time points. Responses were scored against expert-validated answer keys. Inter-rater reliability was assessed using Cohen's kappa, and statistical comparisons were made using the chi-square, McNemar, and t-tests.
Results: Claude and Manus consistently outperformed ChatGPT across both testing phases. In the second round, Claude and Manus achieved a diagnostic accuracy of 92.3%, compared to ChatGPT's 76.9%. Claude and Manus also demonstrated higher intra-model consistency (Cohen's kappa = 0.714 and 0.782, respectively) than ChatGPT (kappa = 0.560). Although the numerical trends favoured Claude and Manus, pairwise differences in accuracy did not reach statistical significance.
Conclusion: Claude and Manus demonstrated numerically higher diagnostic performance and greater response stability compared with ChatGPT; however, these differences did not reach statistical significance and should therefore be interpreted cautiously. This variability across models highlights the need for larger-scale evaluations. These findings underscore the importance of considering both accuracy and consistency when selecting AI tools for integration into dental practice and curricula.
扫码关注我们
求助内容:
应助结果提醒方式:
