评估人工智能聊天机器人在口腔颌面外科委员会考试中的表现和潜力。

IF 2.3 3区 医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE Journal of Oral and Maxillofacial Surgery Pub Date : 2024-11-19 DOI:10.1016/j.joms.2024.11.007
Reema Mahmoud, Amir Shuster, Shlomi Kleinman, Shimrit Arbel, Clariel Ianculovici, Oren Peleg
{"title":"评估人工智能聊天机器人在口腔颌面外科委员会考试中的表现和潜力。","authors":"Reema Mahmoud, Amir Shuster, Shlomi Kleinman, Shimrit Arbel, Clariel Ianculovici, Oren Peleg","doi":"10.1016/j.joms.2024.11.007","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.</p><p><strong>Purpose: </strong>This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.</p><p><strong>Study design, setting, and sample: </strong>An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.</p><p><strong>Predictor variable: </strong>The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).</p><p><strong>Main outcome variables: </strong>The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.</p><p><strong>Covariates: </strong>No additional covariates were considered.</p><p><strong>Analyses: </strong>Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ<sup>2</sup> tests were used to assess response consistency and error correction, with statistical significance set at P < .05.</p><p><strong>Results: </strong>LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).</p><p><strong>Conclusion and relevance: </strong>LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.</p>","PeriodicalId":16612,"journal":{"name":"Journal of Oral and Maxillofacial Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.\",\"authors\":\"Reema Mahmoud, Amir Shuster, Shlomi Kleinman, Shimrit Arbel, Clariel Ianculovici, Oren Peleg\",\"doi\":\"10.1016/j.joms.2024.11.007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.</p><p><strong>Purpose: </strong>This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.</p><p><strong>Study design, setting, and sample: </strong>An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.</p><p><strong>Predictor variable: </strong>The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).</p><p><strong>Main outcome variables: </strong>The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.</p><p><strong>Covariates: </strong>No additional covariates were considered.</p><p><strong>Analyses: </strong>Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ<sup>2</sup> tests were used to assess response consistency and error correction, with statistical significance set at P < .05.</p><p><strong>Results: </strong>LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).</p><p><strong>Conclusion and relevance: </strong>LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.</p>\",\"PeriodicalId\":16612,\"journal\":{\"name\":\"Journal of Oral and Maxillofacial Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Oral and Maxillofacial Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.joms.2024.11.007\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.joms.2024.11.007","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

背景:虽然人工智能对医学产生了重大影响,但大语言模型(LLMs)在口腔颌面外科(OMS)中的应用仍未得到充分探索。目的:本研究旨在衡量和比较4位领先法学硕士在OMS董事会考试问题上的准确性,并确定需要改进的具体领域。研究设计、设置和样本:通过计算机横断面研究,对4个人工智能聊天机器人在714道OMS考题上的表现进行评估。预测变量:预测变量是使用的LLM - llm1(生成式预训练的Transformer 40 [gpt - 40], OpenAI,旧金山,CA), llm2(生成式预训练的Transformer 3.5 [GPT-3.5], OpenAI,旧金山,CA), llm2 3 (Gemini,谷歌,Mountain View, CA)和llm2 4 (Copilot, Microsoft, Redmond, WA)。主要结果变量:主要结果变量为准确性,定义为每个LLM提供的正确答案的百分比。次要结果包括llm在后续尝试中纠正错误的能力,以及他们在11个特定OMS学科领域的表现:医学和麻醉、牙槽牙和种植外科、颌面创伤、颌面感染、颌面病理学、唾腺、肿瘤学、颌面重建、颞下颌关节解剖学和病理学、颅面和腭裂以及正颌外科。协变量:未考虑其他协变量。分析:统计分析包括单因素方差分析和事后Tukey HSD来比较聊天机器人的性能。结果:llm1的准确率最高,平均得分为83.69%,显著优于llm1 (66.85%, P = 0.002)、llm2 (64.83%, P = 0.001)和llm1(62.18%)。结论及相关性:llm1 (gpt - 40)在准确率和纠错率方面均显著优于其他模型,表明其作为OMS教育工具的潜力巨大。然而,不同领域的性能差异凸显了持续改进和持续评估的必要性,以便将这些llm更有效地整合到OMS领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential.

Background: While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored.

Purpose: This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement.

Study design, setting, and sample: An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions.

Predictor variable: The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA).

Main outcome variables: The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery.

Covariates: No additional covariates were considered.

Analyses: Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ2 tests were used to assess response consistency and error correction, with statistical significance set at P < .05.

Results: LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001).

Conclusion and relevance: LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Oral and Maxillofacial Surgery
Journal of Oral and Maxillofacial Surgery 医学-牙科与口腔外科
CiteScore
4.00
自引率
5.30%
发文量
0
审稿时长
41 days
期刊介绍: This monthly journal offers comprehensive coverage of new techniques, important developments and innovative ideas in oral and maxillofacial surgery. Practice-applicable articles help develop the methods used to handle dentoalveolar surgery, facial injuries and deformities, TMJ disorders, oral cancer, jaw reconstruction, anesthesia and analgesia. The journal also includes specifics on new instruments and diagnostic equipment and modern therapeutic drugs and devices. Journal of Oral and Maxillofacial Surgery is recommended for first or priority subscription by the Dental Section of the Medical Library Association.
期刊最新文献
Comparison of Lip Revision Rates in Traditional Versus Early Cleft Lip Repair: An Institutional Review. Is A Surgeon's Self-Perceived Level of Anxiety Associated With the Type of Surgical Procedure Being Performed? Editorial Board Masthead Table of Contents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1