Artificial intelligence (AI) chatbots, also known as large language models (LLMs), have become increasingly common educational tools in healthcare. Although the use of LLMs for emergency dental trauma is gaining popularity, it is crucial to assess their reliability. This study aimed to compare the reliabilities of different LLMs in response to multiple questions related to dental trauma.
In a cross-sectional observational study conducted in October 2024, 30 questions (10 multiple-choice, 10 fill-in-the-blank, and 10 dichotomous) based on the International Association of Dental Traumatology guidelines were posed to five LLMs: ChatGPT 4, ChatGPT 3.5, Copilot Free version (Copilot F), Copilot Pro (Copilot P), and Google Gemini over nine consecutive days. Responses of each model (1350 in total) were recorded in binary format and analyzed using Pearson's chi-square and Fisher's exact tests to assess correctness and consistency (p < 0.05).
The answers provided by the LLMs to repeated questions on consecutive days showed a high degree of repeatability. Although there was no statistically significant difference in the success rate of providing correct answers among the LLMs (p > 0.05), the rankings based on the rate of successful answers were as follows: ChatGPT 3.5 (76.7%) > Copilot P (73.3%) > Copilot F (70%) > ChatGPT 4 (63.3%) > Gemini (46.7%). ChatGPT 3.5, ChatGPT 4, and Gemini showed a significantly higher correct response rate for multiple choice and fill in the blank questions compared to their performance on dichotomous (true/false) questions (p < 0.05). Conversely, The Copilot models did not exhibit significant differences across question types. Notably, the explanations provided by Copilot and Gemini were often inaccurate, and Copilot's cited references had low evidential value.
While LLMs show potential as adjunct educational tools in dental traumatology, their variable accuracy and inclusion of unreliable references call for careful integration strategies.


