Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis

IF 3.7 3区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE International dental journal Pub Date : 2025-02-01 Epub Date: 2024-11-12 DOI:10.1016/j.identj.2024.10.014

Mingxin Liu , Tsuyoshi Okuhara , Wenbo Huang , Atsushi Ogihara , Hikari Sophia Nagao , Hiroko Okada , Takahiro Kiuchi

{"title":"Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis","authors":"Mingxin Liu , Tsuyoshi Okuhara , Wenbo Huang , Atsushi Ogihara , Hikari Sophia Nagao , Hiroko Okada , Takahiro Kiuchi","doi":"10.1016/j.identj.2024.10.014","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction and aims</h3><div>This study systematically reviews and conducts a meta-analysis to evaluate the performance of various large language models (LLMs) in dental licensing examinations worldwide. The aim is to assess the accuracy of these models in different linguistic and geographical contexts. This will inform their potential application in dental education and diagnostics.</div></div><div><h3>Methods</h3><div>Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a comprehensive search across PubMed, Web of Science, and Scopus for studies published from 1 January 2022 to 1 May 2024. Two authors independently reviewed the literature based on the inclusion and exclusion criteria, extracted data, and evaluated the quality of the studies in accordance with the Quality Assessment of Diagnostic Accuracy Studies-2. We conducted qualitative and quantitative analyses to evaluate the performance of LLMs.</div></div><div><h3>Results</h3><div>Eleven studies met the inclusion criteria, encompassing dental licensing examinations from eight countries. GPT-3.5, GPT-4, and Bard achieved integrated accuracy rates of 54%, 72%, and 56%, respectively. GPT-4 outperformed GPT-3.5 and Bard, passing more than half of the dental licensing examinations. Subgroup analyses and meta-regression showed that GPT-3.5 performed significantly better in English-speaking countries. GPT-4’s performance, however, remained consistent across different regions.</div></div><div><h3>Conclusion</h3><div>LLMs, particularly GPT-4, show potential in dental education and diagnostics, yet their accuracy remains below the threshold required for clinical application. The lack of sufficient training data in dentistry has affected LLMs’ accuracy. The reliance on image-based diagnostics also presents challenges. As a result, their accuracy in dental exams is lower compared to medical licensing exams. Additionally, LLMs even provide more detailed explanation for incorrect answer than correct one. Overall, the current LLMs are not yet suitable for use in dental education and clinical diagnosis.</div></div>","PeriodicalId":13785,"journal":{"name":"International dental journal","volume":"75 1","pages":"Pages 213-222"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International dental journal","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020653924015685","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction and aims

This study systematically reviews and conducts a meta-analysis to evaluate the performance of various large language models (LLMs) in dental licensing examinations worldwide. The aim is to assess the accuracy of these models in different linguistic and geographical contexts. This will inform their potential application in dental education and diagnostics.

Methods

Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a comprehensive search across PubMed, Web of Science, and Scopus for studies published from 1 January 2022 to 1 May 2024. Two authors independently reviewed the literature based on the inclusion and exclusion criteria, extracted data, and evaluated the quality of the studies in accordance with the Quality Assessment of Diagnostic Accuracy Studies-2. We conducted qualitative and quantitative analyses to evaluate the performance of LLMs.

Results

Eleven studies met the inclusion criteria, encompassing dental licensing examinations from eight countries. GPT-3.5, GPT-4, and Bard achieved integrated accuracy rates of 54%, 72%, and 56%, respectively. GPT-4 outperformed GPT-3.5 and Bard, passing more than half of the dental licensing examinations. Subgroup analyses and meta-regression showed that GPT-3.5 performed significantly better in English-speaking countries. GPT-4’s performance, however, remained consistent across different regions.

Conclusion

LLMs, particularly GPT-4, show potential in dental education and diagnostics, yet their accuracy remains below the threshold required for clinical application. The lack of sufficient training data in dentistry has affected LLMs’ accuracy. The reliance on image-based diagnostics also presents challenges. As a result, their accuracy in dental exams is lower compared to medical licensing exams. Additionally, LLMs even provide more detailed explanation for incorrect answer than correct one. Overall, the current LLMs are not yet suitable for use in dental education and clinical diagnosis.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

牙医执照考试中的大语言模型：系统回顾和元分析。

引言和目的：本研究系统回顾并进行了荟萃分析，以评估各种大语言模型（LLMs）在全球牙医执照考试中的表现。目的是评估这些模型在不同语言和地理环境下的准确性。这将为它们在牙科教育和诊断中的潜在应用提供信息：根据《系统综述和元分析首选报告项目》指南，我们在 PubMed、Web of Science 和 Scopus 上对 2022 年 1 月 1 日至 2024 年 5 月 1 日期间发表的研究进行了全面检索。两位作者根据纳入和排除标准独立审阅了文献，提取了数据，并根据诊断准确性研究质量评估-2对研究质量进行了评估。我们进行了定性和定量分析，以评估 LLM 的性能：有 11 项研究符合纳入标准，涵盖了 8 个国家的牙医执照考试。GPT-3.5、GPT-4 和 Bard 的综合准确率分别为 54%、72% 和 56%。GPT-4 的表现优于 GPT-3.5 和 Bard，通过了一半以上的牙医执照考试。分组分析和元回归显示，GPT-3.5 在英语国家的表现明显更好。然而，GPT-4在不同地区的表现保持一致：结论：LLMs，尤其是 GPT-4，在牙科教育和诊断方面显示出潜力，但其准确性仍低于临床应用所需的阈值。牙科缺乏足够的训练数据影响了 LLMs 的准确性。对基于图像的诊断的依赖也带来了挑战。因此，与医学执照考试相比，其在牙科考试中的准确性较低。此外，法律硕士对错误答案的解释甚至比对正确答案的解释更详细。总体而言，目前的 LLM 尚不适合用于牙科教育和临床诊断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International dental journal 医学-牙科与口腔外科

CiteScore

4.80

自引率

6.10%

发文量

159

审稿时长

63 days

期刊介绍： The International Dental Journal features peer-reviewed, scientific articles relevant to international oral health issues, as well as practical, informative articles aimed at clinicians.