大型语言模型在全国牙科执业资格考试中的表现:ChatGPT、GPT-4和New Bing的比较分析

IF 1.8 4区 医学 Q2 DENTISTRY, ORAL SURGERY & MEDICINE International Journal of Computerized Dentistry Pub Date : 2024-12-09 DOI:10.3290/j.ijcd.b5870240
Ziyang Hu, Zhe Xu, Ping Shi, Dandan Zhang, Qu Yue, Jiexia Zhang, Xin Lei, Zitong Lin
{"title":"大型语言模型在全国牙科执业资格考试中的表现:ChatGPT、GPT-4和New Bing的比较分析","authors":"Ziyang Hu, Zhe Xu, Ping Shi, Dandan Zhang, Qu Yue, Jiexia Zhang, Xin Lei, Zitong Lin","doi":"10.3290/j.ijcd.b5870240","DOIUrl":null,"url":null,"abstract":"<p><strong>Aim: </strong>The objective of the present study was to investigate the clinical understanding and reasoning abilities of large language models (LLMs); namely, ChatGPT, GPT-4, and New Bing, by evaluating their performance in the NDLE (National Dental Licensing Examination) in China.</p><p><strong>Materials and methods: </strong>Questions from the NDLE from 2020 to 2022 were selected based on subject weightings. Standardized prompts were utilized to regulate the output of LLMs for acquiring more precise answers. The performance of each model across each subject category and for the subjects overall was analyzed employing the McNemar's test.</p><p><strong>Results: </strong>The percentage scores obtained by ChatGPT, GPT-4, and New Bing were 42.6% (138/324), 63.0% (204/324), and 72.5% (235/324), respectively. Significant variance was seen between the performance of New Bing compared with ChatGPT and GPT-4. GPT-4 and New Bing outperformed ChatGPT across all subjects, with New Bing surpassing GPT-4 in most subjects.</p><p><strong>Conclusion: </strong>GPT-4 and New Bing exhibited promising capabilities in the NDLE. However, their performance in specific subjects such as prosthodontics and oral and maxillofacial surgery requires improvement. This performance gap can be attributed to limited dental training data and the inherent complexity of these subjects.</p>","PeriodicalId":48666,"journal":{"name":"International Journal of Computerized Dentistry","volume":"27 4","pages":"401-411"},"PeriodicalIF":1.8000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of large language models in the National Dental Licensing Examination in China: a comparative analysis of ChatGPT, GPT-4, and New Bing.\",\"authors\":\"Ziyang Hu, Zhe Xu, Ping Shi, Dandan Zhang, Qu Yue, Jiexia Zhang, Xin Lei, Zitong Lin\",\"doi\":\"10.3290/j.ijcd.b5870240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Aim: </strong>The objective of the present study was to investigate the clinical understanding and reasoning abilities of large language models (LLMs); namely, ChatGPT, GPT-4, and New Bing, by evaluating their performance in the NDLE (National Dental Licensing Examination) in China.</p><p><strong>Materials and methods: </strong>Questions from the NDLE from 2020 to 2022 were selected based on subject weightings. Standardized prompts were utilized to regulate the output of LLMs for acquiring more precise answers. The performance of each model across each subject category and for the subjects overall was analyzed employing the McNemar's test.</p><p><strong>Results: </strong>The percentage scores obtained by ChatGPT, GPT-4, and New Bing were 42.6% (138/324), 63.0% (204/324), and 72.5% (235/324), respectively. Significant variance was seen between the performance of New Bing compared with ChatGPT and GPT-4. GPT-4 and New Bing outperformed ChatGPT across all subjects, with New Bing surpassing GPT-4 in most subjects.</p><p><strong>Conclusion: </strong>GPT-4 and New Bing exhibited promising capabilities in the NDLE. However, their performance in specific subjects such as prosthodontics and oral and maxillofacial surgery requires improvement. This performance gap can be attributed to limited dental training data and the inherent complexity of these subjects.</p>\",\"PeriodicalId\":48666,\"journal\":{\"name\":\"International Journal of Computerized Dentistry\",\"volume\":\"27 4\",\"pages\":\"401-411\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computerized Dentistry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3290/j.ijcd.b5870240\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computerized Dentistry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3290/j.ijcd.b5870240","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究旨在探讨大型语言模型(LLMs)的临床理解和推理能力;即ChatGPT, GPT-4和New Bing,通过评估他们在NDLE(中国国家牙科执照考试)中的表现。材料和方法:根据主题权重选择2020 - 2022年NDLE中的问题。使用标准化提示来调节llm的输出,以获得更精确的答案。采用McNemar测试分析了每个模型在每个主题类别和整体主题中的表现。结果:ChatGPT、GPT-4和New Bing的得分百分比分别为42.6%(138/324)、63.0%(204/324)和72.5%(235/324)。与ChatGPT和GPT-4相比,New Bing的性能有显著差异。GPT-4和新必应在所有科目中都优于ChatGPT,新必应在大多数科目中超过了GPT-4。结论:GPT-4和New Bing在NDLE中表现出良好的疗效。然而,他们在特定学科的表现,如修复和口腔颌面外科需要改进。这种表现差距可归因于有限的牙科培训数据和这些学科固有的复杂性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Performance of large language models in the National Dental Licensing Examination in China: a comparative analysis of ChatGPT, GPT-4, and New Bing.

Aim: The objective of the present study was to investigate the clinical understanding and reasoning abilities of large language models (LLMs); namely, ChatGPT, GPT-4, and New Bing, by evaluating their performance in the NDLE (National Dental Licensing Examination) in China.

Materials and methods: Questions from the NDLE from 2020 to 2022 were selected based on subject weightings. Standardized prompts were utilized to regulate the output of LLMs for acquiring more precise answers. The performance of each model across each subject category and for the subjects overall was analyzed employing the McNemar's test.

Results: The percentage scores obtained by ChatGPT, GPT-4, and New Bing were 42.6% (138/324), 63.0% (204/324), and 72.5% (235/324), respectively. Significant variance was seen between the performance of New Bing compared with ChatGPT and GPT-4. GPT-4 and New Bing outperformed ChatGPT across all subjects, with New Bing surpassing GPT-4 in most subjects.

Conclusion: GPT-4 and New Bing exhibited promising capabilities in the NDLE. However, their performance in specific subjects such as prosthodontics and oral and maxillofacial surgery requires improvement. This performance gap can be attributed to limited dental training data and the inherent complexity of these subjects.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Computerized Dentistry
International Journal of Computerized Dentistry Dentistry-Dentistry (miscellaneous)
CiteScore
2.90
自引率
0.00%
发文量
49
期刊介绍: This journal explores the myriad innovations in the emerging field of computerized dentistry and how to integrate them into clinical practice. The bulk of the journal is devoted to the science of computer-assisted dentistry, with research articles and clinical reports on all aspects of computer-based diagnostic and therapeutic applications, with special emphasis placed on CAD/CAM and image-processing systems. Articles also address the use of computer-based communication to support patient care, assess the quality of care, and enhance clinical decision making. The journal is presented in a bilingual format, with each issue offering three types of articles: science-based, application-based, and national society reports.
期刊最新文献
Accuracy of complete-arch, All-on-4 implant scans under simulated intraoral variables. Accuracy of intraoral scanners in neonates cleft anatomy. Effect of the abutment shape on soft tissue healing. A randomized clinical pilot study involving a digital superposition methodology. Intraoral scanning accuracy and trueness for different dental restorations. OccluSense: Reliability, influencing factors and limitations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1