How well do large language model-based chatbots perform in oral and maxillofacial radiology?

IF 2.9 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE Dento maxillo facial radiology Pub Date : 2024-09-01 DOI:10.1093/dmfr/twae021

Hui Jeong, Sang-Sun Han, Youngjae Yu, Saejin Kim, Kug Jin Jeon

{"title":"How well do large language model-based chatbots perform in oral and maxillofacial radiology?","authors":"Hui Jeong, Sang-Sun Han, Youngjae Yu, Saejin Kim, Kug Jin Jeon","doi":"10.1093/dmfr/twae021","DOIUrl":null,"url":null,"abstract":"Objectives: This study evaluated the performance of four large language model (LLM)-based chatbots by comparing their test results with those of dental students on an oral and maxillofacial radiology examination.Methods: ChatGPT, ChatGPT Plus, Bard, and Bing Chat were tested on 52 questions from regular dental college examinations. These questions were categorized into three educational content areas: basic knowledge, imaging and equipment, and image interpretation. They were also classified as multiple-choice questions (MCQs) and short-answer questions (SAQs). The accuracy rates of the chatbots were compared with the performance of students, and further analysis was conducted based on the educational content and question type.Results: The students' overall accuracy rate was 81.2%, while that of the chatbots varied: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat. ChatGPT Plus achieved a higher accuracy rate for basic knowledge than the students (93.8% vs. 78.7%). However, all chatbots performed poorly in image interpretation, with accuracy rates below 35.0%. All chatbots scored less than 60.0% on MCQs, but performed better on SAQs.Conclusions: The performance of chatbots in oral and maxillofacial radiology was unsatisfactory. Further training using specific, relevant data derived solely from reliable sources is required. Additionally, the validity of these chatbots' responses must be meticulously verified.","PeriodicalId":11261,"journal":{"name":"Dento maxillo facial radiology","volume":" ","pages":"390-395"},"PeriodicalIF":2.9000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358622/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dento maxillo facial radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/dmfr/twae021","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study evaluated the performance of four large language model (LLM)-based chatbots by comparing their test results with those of dental students on an oral and maxillofacial radiology examination.

Methods: ChatGPT, ChatGPT Plus, Bard, and Bing Chat were tested on 52 questions from regular dental college examinations. These questions were categorized into three educational content areas: basic knowledge, imaging and equipment, and image interpretation. They were also classified as multiple-choice questions (MCQs) and short-answer questions (SAQs). The accuracy rates of the chatbots were compared with the performance of students, and further analysis was conducted based on the educational content and question type.

Results: The students' overall accuracy rate was 81.2%, while that of the chatbots varied: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat. ChatGPT Plus achieved a higher accuracy rate for basic knowledge than the students (93.8% vs. 78.7%). However, all chatbots performed poorly in image interpretation, with accuracy rates below 35.0%. All chatbots scored less than 60.0% on MCQs, but performed better on SAQs.

Conclusions: The performance of chatbots in oral and maxillofacial radiology was unsatisfactory. Further training using specific, relevant data derived solely from reliable sources is required. Additionally, the validity of these chatbots' responses must be meticulously verified.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于大型语言模型的聊天机器人在口腔颌面放射学中的表现如何？

研究目的本研究通过比较四个基于大语言模型（LLM）的聊天机器人与牙科学生在口腔颌面放射学考试中的测试结果，评估了它们的性能：方法：对 ChatGPT、ChatGPT Plus、Bard 和 Bing Chat 进行了测试，测试内容为口腔医学院常规考试中的 52 个问题。这些问题分为三个教育内容领域：基础知识、成像和设备以及图像解读。这些问题还分为选择题（MCQ）和简答题（SAQ）。聊天机器人的正确率与学生的表现进行了比较，并根据教学内容和问题类型进行了进一步分析：结果：学生的总体正确率为 81.2%，而聊天机器人的正确率则各不相同：ChatGPT 为 50.0%，ChatGPT Plus 为 65.4%，Bard 为 50.0%，Bing Chat 为 63.5%。ChatGPT Plus 的基础知识准确率高于学生（93.8% 对 78.7%）。但是，所有聊天机器人在图像解读方面都表现不佳，准确率低于 35.0%。所有聊天机器人在 MCQ 上的得分都低于 60.0%，但在 SAQ 上表现较好：聊天机器人在口腔颌面放射学中的表现并不令人满意。需要使用完全来自可靠来源的特定相关数据进行进一步培训。此外，必须对这些聊天机器人回答的有效性进行严格验证：这项研究是口腔颌面放射学领域首次对四个聊天机器人的知识水平进行评估。鉴于聊天机器人的表现不尽如人意，我们建议对所有聊天机器人进行该领域的进一步培训。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Dento maxillo facial radiology 医学-核医学

CiteScore

5.60

自引率

9.10%

发文量

审稿时长

4-8 weeks

期刊介绍： Dentomaxillofacial Radiology (DMFR) is the journal of the International Association of Dentomaxillofacial Radiology (IADMFR) and covers the closely related fields of oral radiology and head and neck imaging. Established in 1972, DMFR is a key resource keeping dentists, radiologists and clinicians and scientists with an interest in Head and Neck imaging abreast of important research and developments in oral and maxillofacial radiology. The DMFR editorial board features a panel of international experts including Editor-in-Chief Professor Ralf Schulze. Our editorial board provide their expertise and guidance in shaping the content and direction of the journal. Quick Facts: - 2015 Impact Factor - 1.919 - Receipt to first decision - average of 3 weeks - Acceptance to online publication - average of 3 weeks - Open access option - ISSN: 0250-832X - eISSN: 1476-542X