{"title":"Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.","authors":"Reema Mahmoud, Amir Shuster, Shlomi Kleinman, Shimrit Arbel, Clariel Ianculovici, Oren Peleg","doi":"10.1016/j.joms.2024.11.007","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.</p><p><strong>Purpose: </strong>This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.</p><p><strong>Study design, setting, and sample: </strong>An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.</p><p><strong>Predictor variable: </strong>The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).</p><p><strong>Main outcome variables: </strong>The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.</p><p><strong>Covariates: </strong>No additional covariates were considered.</p><p><strong>Analyses: </strong>Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ<sup>2</sup> tests were used to assess response consistency and error correction, with statistical significance set at P < .05.</p><p><strong>Results: </strong>LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).</p><p><strong>Conclusion and relevance: </strong>LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.</p>","PeriodicalId":16612,"journal":{"name":"Journal of Oral and Maxillofacial Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.joms.2024.11.007","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background: While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.
Purpose: This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.
Study design, setting, and sample: An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.
Predictor variable: The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).
Main outcome variables: The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.
Covariates: No additional covariates were considered.
Analyses: Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ2 tests were used to assess response consistency and error correction, with statistical significance set at P < .05.
Results: LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).
Conclusion and relevance: LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.
期刊介绍:
This monthly journal offers comprehensive coverage of new techniques, important developments and innovative ideas in oral and maxillofacial surgery. Practice-applicable articles help develop the methods used to handle dentoalveolar surgery, facial injuries and deformities, TMJ disorders, oral cancer, jaw reconstruction, anesthesia and analgesia. The journal also includes specifics on new instruments and diagnostic equipment and modern therapeutic drugs and devices. Journal of Oral and Maxillofacial Surgery is recommended for first or priority subscription by the Dental Section of the Medical Library Association.