韩国国家牙科保健师考试韩语和英语试题中大语言模型应答准确性的比较分析。

IF 1.4 4区医学 Q3 DENTISTRY, ORAL SURGERY & MEDICINE International journal of dental hygiene Pub Date : 2024-10-16 DOI:10.1111/idh.12848

Eun Sun Song, Seung-Pyo Lee

{"title":"韩国国家牙科保健师考试韩语和英语试题中大语言模型应答准确性的比较分析。","authors":"Eun Sun Song, Seung-Pyo Lee","doi":"10.1111/idh.12848","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>Large language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019–2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>GPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>These findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.</p>\n </section>\n </div>","PeriodicalId":13791,"journal":{"name":"International journal of dental hygiene","volume":"23 2","pages":"267-276"},"PeriodicalIF":1.4000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/idh.12848","citationCount":"0","resultStr":"{\"title\":\"Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions\",\"authors\":\"Eun Sun Song, Seung-Pyo Lee\",\"doi\":\"10.1111/idh.12848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Introduction</h3>\\n \\n <p>Large language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019–2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>GPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusions</h3>\\n \\n <p>These findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.</p>\\n </section>\\n </div>\",\"PeriodicalId\":13791,\"journal\":{\"name\":\"International journal of dental hygiene\",\"volume\":\"23 2\",\"pages\":\"267-276\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/idh.12848\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of dental hygiene\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/idh.12848\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of dental hygiene","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/idh.12848","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

导言：Gemini、GPT-3.5 和 GPT-4 等大型语言模型已在医学领域展现出巨大潜力。它们在全球医疗执照考试中的表现突显了它们在理解和处理专业医学知识方面的能力。本研究旨在评估和比较 Gemini、GPT-3.5 和 GPT-4 在韩国全国牙科保健师考试中的表现。研究还评估了用韩语和英语回答考题的准确性：本研究使用的数据集包括 5 年内（2019-2023 年）韩国国家牙科保健师考试的试题。采用双向方差分析（ANOVA）测试来研究模型类型和语言对答题准确性的影响。在标准化条件下将问题输入每个模型，并根据预定标准将回答分为正确或错误：结果：GPT-4 的表现一直优于其他模型，在每年的两个语言版本中都达到了最高的准确率。特别是，它在英语方面的表现更为出色，这表明其语言处理训练算法取得了进步。不过，所有模型在具有本地化特征的受试者（如卫生和医疗法律）中都表现出了不同的准确率：这些研究结果表明，GPT-4 在医学教育和标准化测试中的应用前景广阔，尤其是在英语方面。然而，在不同科目和语言中的表现差异突出表明，需要不断改进并纳入更多样化和本地化的训练数据集，以提高模型在多语言和多文化背景下的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions

Introduction

Large language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.

Methods

This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019–2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.

Results

GPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.

Conclusions

These findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International journal of dental hygiene DENTISTRY, ORAL SURGERY & MEDICINE-

CiteScore

4.00

自引率

8.30%

发文量

审稿时长

>12 weeks

期刊介绍： International Journal of Dental Hygiene is the official scientific peer-reviewed journal of the International Federation of Dental Hygienists (IFDH). The journal brings the latest scientific news, high quality commissioned reviews as well as clinical, professional and educational developmental and legislative news to the profession world-wide. Thus, it acts as a forum for exchange of relevant information and enhancement of the profession with the purpose of promoting oral health for patients and communities. The aim of the International Journal of Dental Hygiene is to provide a forum for exchange of scientific knowledge in the field of oral health and dental hygiene. A further aim is to support and facilitate the application of new knowledge into clinical practice. The journal welcomes original research, reviews and case reports as well as clinical, professional, educational and legislative news to the profession world-wide.