Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

IF 2.3 3区 医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE Journal of Oral and Maxillofacial Surgery Pub Date : 2024-11-19 DOI:10.1016/j.joms.2024.11.007
Reema Mahmoud, Amir Shuster, Shlomi Kleinman, Shimrit Arbel, Clariel Ianculovici, Oren Peleg
{"title":"Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.","authors":"Reema Mahmoud, Amir Shuster, Shlomi Kleinman, Shimrit Arbel, Clariel Ianculovici, Oren Peleg","doi":"10.1016/j.joms.2024.11.007","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.</p><p><strong>Purpose: </strong>This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.</p><p><strong>Study design, setting, and sample: </strong>An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.</p><p><strong>Predictor variable: </strong>The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).</p><p><strong>Main outcome variables: </strong>The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.</p><p><strong>Covariates: </strong>No additional covariates were considered.</p><p><strong>Analyses: </strong>Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ<sup>2</sup> tests were used to assess response consistency and error correction, with statistical significance set at P < .05.</p><p><strong>Results: </strong>LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).</p><p><strong>Conclusion and relevance: </strong>LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.</p>","PeriodicalId":16612,"journal":{"name":"Journal of Oral and Maxillofacial Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.joms.2024.11.007","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Background: While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.

Purpose: This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.

Study design, setting, and sample: An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.

Predictor variable: The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).

Main outcome variables: The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.

Covariates: No additional covariates were considered.

Analyses: Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ2 tests were used to assess response consistency and error correction, with statistical significance set at P < .05.

Results: LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).

Conclusion and relevance: LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Oral and Maxillofacial Surgery
Journal of Oral and Maxillofacial Surgery 医学-牙科与口腔外科
CiteScore
4.00
自引率
5.30%
发文量
0
审稿时长
41 days
期刊介绍: This monthly journal offers comprehensive coverage of new techniques, important developments and innovative ideas in oral and maxillofacial surgery. Practice-applicable articles help develop the methods used to handle dentoalveolar surgery, facial injuries and deformities, TMJ disorders, oral cancer, jaw reconstruction, anesthesia and analgesia. The journal also includes specifics on new instruments and diagnostic equipment and modern therapeutic drugs and devices. Journal of Oral and Maxillofacial Surgery is recommended for first or priority subscription by the Dental Section of the Medical Library Association.
期刊最新文献
Does Varying Platelet-Rich Fibrin Centri̇fugati̇on Protocols Enhance New Bone Formati̇on in Extracti̇on Site? Fluorescence Visualization-Guided Surgery Improves Local Control for Mandibular Squamous Cell Carcinoma. Do Postoperative Surgeon Phone Calls Improve Outcomes Following Mandibular Fracture Repair? Geographic Trends in the Oral and Maxillofacial Surgery Residency Match. What is the Minimal Perceptible Change for the Dimensional Alteration of Facial Structures in the Frontal View?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1