Background: Large Language Models (LLMs) are transforming clinical decision-making by offering rapid, context-aware access to evidence-based knowledge. However, their efficacy in pediatric dentistry remains underexplored, especially across multiple LLM platforms.Objective: To comparatively evaluate the clinical quality, readability, and originality of responses generated by nine contemporary LLMs for pediatric dental queries.
Material and methods: A cross-sectional study assessed the performance of ChatGPT-3.5, ChatGPT-4o, Gemini 2.0, Gemini 2.5, Claude 3.5 Haiku, Claude 3.7 Sonnet, Grok-3, Grok-3 Mini, and DeepSeek-V3. Twenty pediatric dental questions were posed in one-shot queries to each LLM. Responses were evaluated by ten pediatric dental experts using the Modified Global Quality Scale (MGQS), Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), and Turnitin Similarity Index. ANOVA and Cohen's Kappa were used for statistical analysis.
Results: ChatGPT-4o demonstrated the highest overall MGQS (4.28 ± 0.24), followed by ChatGPT-3.5 (3.45 ± 0.27). DeepSeek-V3 scored lowest (2.18 ± 0.19). Topic-wise, ChatGPT-4o consistently outperformed others across all subdomains. FRES and FKGL scores indicated moderate readability, with Claude models exhibiting the highest linguistic complexity. Turnitin analysis revealed low-to-moderate similarity across models. Inter-rater agreement was substantial (κ = 0.78).
Conclusions: Among evaluated LLMs, ChatGPT-4o exhibited superior performance in clinical relevance, coherence, and originality, suggesting its potential utility as an adjunct in pediatric dental decision-making. Nonetheless, variability across models underscores the need for critical appraisal and cautious integration into clinical workflows. Key words:Artificial Intelligence, Clinical decision support, Health Communication, Large language models, Natural Language Processing.
扫码关注我们
求助内容:
应助结果提醒方式:
