Background
The term "artificial intelligence" first introduced in the 1950s, has reached a turning point following recent advances. As chatbot capabilities continue to evolve, it becomes important to ensure that the information they provide remains reliable and accurate.
Objectives
To assess the reliability, accuracy, and readability of the responses provided by 3 chatbots (ChatGPT 4.0, Google Gemini 1.5, and Claude 3.5 Sonnet) to inquiries about ortho-perio treatment procedures.
Methods
16 questions regarding treatments related to orthodontics and periodontics were presented to 3 different chatbots, and their responses were collected. 2 orthodontists and 1 periodontist assessed the answers using multiple tests. For reliability (Modified DISCERN tool), for accuracy (Likert scale and the Accuracy of Information Index (AOI)), and for readability (the Flesch-Kincaid Reading Ease Score (FRES) and Grade Level (FKGL)) were used.
Results
Regarding Likert scale, the highest average scores were given to ChatGPT at 4.9, followed by Claude at 4.8 and Gemini at 4.5. A statistically significant difference was observed between the scores of both chatbots and Gemini, with p-values of 0.003 and 0.015, respectively. Looking at the Accuracy of Information Index (AOI) scores, ChatGPT had the highest average score of 8.8, followed by Claude with 8.8, and Gemini with 8.1. The scores of ChatGPT and Claude were found to be statistically significantly higher compared to Gemini, with p-values of 0.003 and 0.005, respectively. Regarding DISCERN test, the highest average score was observed for Gemini with 34.8. Claude and ChatGPT both showed statistically significant differences with Gemini, with p-value of 0.001. About FRES and FKGL, all chatbots produced college-level texts that were challenging to read.
Conclusions
Regarding questions about ortho-perio treatment procedures, Gemini demonstrated good reliability. In terms of accuracy, ChatGPT and Claude received higher scores than Gemini.
扫码关注我们
求助内容:
应助结果提醒方式:
