How well do large language model-based chatbots perform in oral and maxillofacial radiology?

IF 2.9 2区 医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE Dento maxillo facial radiology Pub Date : 2024-09-01 DOI:10.1093/dmfr/twae021
Hui Jeong, Sang-Sun Han, Youngjae Yu, Saejin Kim, Kug Jin Jeon
{"title":"How well do large language model-based chatbots perform in oral and maxillofacial radiology?","authors":"Hui Jeong, Sang-Sun Han, Youngjae Yu, Saejin Kim, Kug Jin Jeon","doi":"10.1093/dmfr/twae021","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study evaluated the performance of four large language model (LLM)-based chatbots by comparing their test results with those of dental students on an oral and maxillofacial radiology examination.</p><p><strong>Methods: </strong>ChatGPT, ChatGPT Plus, Bard, and Bing Chat were tested on 52 questions from regular dental college examinations. These questions were categorized into three educational content areas: basic knowledge, imaging and equipment, and image interpretation. They were also classified as multiple-choice questions (MCQs) and short-answer questions (SAQs). The accuracy rates of the chatbots were compared with the performance of students, and further analysis was conducted based on the educational content and question type.</p><p><strong>Results: </strong>The students' overall accuracy rate was 81.2%, while that of the chatbots varied: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat. ChatGPT Plus achieved a higher accuracy rate for basic knowledge than the students (93.8% vs. 78.7%). However, all chatbots performed poorly in image interpretation, with accuracy rates below 35.0%. All chatbots scored less than 60.0% on MCQs, but performed better on SAQs.</p><p><strong>Conclusions: </strong>The performance of chatbots in oral and maxillofacial radiology was unsatisfactory. Further training using specific, relevant data derived solely from reliable sources is required. Additionally, the validity of these chatbots' responses must be meticulously verified.</p>","PeriodicalId":11261,"journal":{"name":"Dento maxillo facial radiology","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358622/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dento maxillo facial radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/dmfr/twae021","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: This study evaluated the performance of four large language model (LLM)-based chatbots by comparing their test results with those of dental students on an oral and maxillofacial radiology examination.

Methods: ChatGPT, ChatGPT Plus, Bard, and Bing Chat were tested on 52 questions from regular dental college examinations. These questions were categorized into three educational content areas: basic knowledge, imaging and equipment, and image interpretation. They were also classified as multiple-choice questions (MCQs) and short-answer questions (SAQs). The accuracy rates of the chatbots were compared with the performance of students, and further analysis was conducted based on the educational content and question type.

Results: The students' overall accuracy rate was 81.2%, while that of the chatbots varied: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat. ChatGPT Plus achieved a higher accuracy rate for basic knowledge than the students (93.8% vs. 78.7%). However, all chatbots performed poorly in image interpretation, with accuracy rates below 35.0%. All chatbots scored less than 60.0% on MCQs, but performed better on SAQs.

Conclusions: The performance of chatbots in oral and maxillofacial radiology was unsatisfactory. Further training using specific, relevant data derived solely from reliable sources is required. Additionally, the validity of these chatbots' responses must be meticulously verified.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于大型语言模型的聊天机器人在口腔颌面放射学中的表现如何?
研究目的本研究通过比较四个基于大语言模型(LLM)的聊天机器人与牙科学生在口腔颌面放射学考试中的测试结果,评估了它们的性能:方法:对 ChatGPT、ChatGPT Plus、Bard 和 Bing Chat 进行了测试,测试内容为口腔医学院常规考试中的 52 个问题。这些问题分为三个教育内容领域:基础知识、成像和设备以及图像解读。这些问题还分为选择题(MCQ)和简答题(SAQ)。聊天机器人的正确率与学生的表现进行了比较,并根据教学内容和问题类型进行了进一步分析:结果:学生的总体正确率为 81.2%,而聊天机器人的正确率则各不相同:ChatGPT 为 50.0%,ChatGPT Plus 为 65.4%,Bard 为 50.0%,Bing Chat 为 63.5%。ChatGPT Plus 的基础知识准确率高于学生(93.8% 对 78.7%)。但是,所有聊天机器人在图像解读方面都表现不佳,准确率低于 35.0%。所有聊天机器人在 MCQ 上的得分都低于 60.0%,但在 SAQ 上表现较好:聊天机器人在口腔颌面放射学中的表现并不令人满意。需要使用完全来自可靠来源的特定相关数据进行进一步培训。此外,必须对这些聊天机器人回答的有效性进行严格验证:这项研究是口腔颌面放射学领域首次对四个聊天机器人的知识水平进行评估。鉴于聊天机器人的表现不尽如人意,我们建议对所有聊天机器人进行该领域的进一步培训。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
5.60
自引率
9.10%
发文量
65
审稿时长
4-8 weeks
期刊介绍: Dentomaxillofacial Radiology (DMFR) is the journal of the International Association of Dentomaxillofacial Radiology (IADMFR) and covers the closely related fields of oral radiology and head and neck imaging. Established in 1972, DMFR is a key resource keeping dentists, radiologists and clinicians and scientists with an interest in Head and Neck imaging abreast of important research and developments in oral and maxillofacial radiology. The DMFR editorial board features a panel of international experts including Editor-in-Chief Professor Ralf Schulze. Our editorial board provide their expertise and guidance in shaping the content and direction of the journal. Quick Facts: - 2015 Impact Factor - 1.919 - Receipt to first decision - average of 3 weeks - Acceptance to online publication - average of 3 weeks - Open access option - ISSN: 0250-832X - eISSN: 1476-542X
期刊最新文献
A novel method for measuring the direction and angle of Central ray and predicting rotation center via panorama phantom. Automatic classification and segmentation of multiclass jaw lesions in cone-beam CT using deep learning. An artificial intelligence grading system of apical periodontitis in cone-beam computed tomography data. In vitro accuracy of ultra-low dose cone-beam CT for detection of proximal caries. Hypervigilance to pain and sleep quality are confounding variables in the infrared thermography examination of the temporomandibular joint and temporal and masseter muscles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1