Rashi Ramchandani, Eddie Guo, Michael Mostowy, Jason Kreutz, Nick Sahlollbey, Michele M Carr, Janet Chung, Lisa Caulley
{"title":"Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank.","authors":"Rashi Ramchandani, Eddie Guo, Michael Mostowy, Jason Kreutz, Nick Sahlollbey, Michele M Carr, Janet Chung, Lisa Caulley","doi":"10.1111/coa.14302","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To compare the performance of Google Bard, Microsoft Copilot, GPT-4 with vision (GPT-4) and Gemini Ultra on the OTO Chautauqua, a student-created, faculty-reviewed otolaryngology question bank.</p><p><strong>Study design: </strong>Comparative performance evaluation of different LLMs.</p><p><strong>Setting: </strong>N/A.</p><p><strong>Participants: </strong>N/A.</p><p><strong>Methods: </strong>Large language models (LLMs) are being extensively tested in medical education. However, their accuracy and effectiveness remain understudied, particularly in otolaryngology. This study involved inputting 350 single-best-answer multiple choice questions, including 18 image-based questions, into four LLMS. Questions were sourced from six independent question banks related to (a) rhinology, (b) head and neck oncology, (c) endocrinology, (d) general otolaryngology, (e) paediatrics, (f) otology, (g) facial plastics, reconstruction and (h) trauma. LLMs were instructed to provide an output reasoning for their answers, the length of which was recorded.</p><p><strong>Results: </strong>Aggregate and subgroup analysis revealed that Gemini (79.8%) outperformed the other LLMs, followed by GPT-4 (71.1%), Copilot (68.0%), and Bard (65.1%) in accuracy. The LLMs had significantly different average response lengths, with Bard (x̄ = 1685.24) being the longest and no difference between GPT-4 (x̄ = 827.34) and Copilot (x̄ = 904.12). Gemini's longer responses (x̄ =1291.68) included explanatory images and links. Gemini and GPT-4 correctly answered image-based questions (n = 18), unlike Copilot and Bard, highlighting their adaptability and multimodal capabilities.</p><p><strong>Conclusion: </strong>Gemini outperformed the other LLMs in terms of accuracy, followed by GPT-4, Copilot and Bard. GPT-4, although it has the second-highest accuracy, provides concise and relevant explanations. Despite the promising performance of LLMs, medical learners should cautiously assess accuracy and decision-making reliability.</p>","PeriodicalId":10431,"journal":{"name":"Clinical Otolaryngology","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Otolaryngology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/coa.14302","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: To compare the performance of Google Bard, Microsoft Copilot, GPT-4 with vision (GPT-4) and Gemini Ultra on the OTO Chautauqua, a student-created, faculty-reviewed otolaryngology question bank.
Study design: Comparative performance evaluation of different LLMs.
Setting: N/A.
Participants: N/A.
Methods: Large language models (LLMs) are being extensively tested in medical education. However, their accuracy and effectiveness remain understudied, particularly in otolaryngology. This study involved inputting 350 single-best-answer multiple choice questions, including 18 image-based questions, into four LLMS. Questions were sourced from six independent question banks related to (a) rhinology, (b) head and neck oncology, (c) endocrinology, (d) general otolaryngology, (e) paediatrics, (f) otology, (g) facial plastics, reconstruction and (h) trauma. LLMs were instructed to provide an output reasoning for their answers, the length of which was recorded.
Results: Aggregate and subgroup analysis revealed that Gemini (79.8%) outperformed the other LLMs, followed by GPT-4 (71.1%), Copilot (68.0%), and Bard (65.1%) in accuracy. The LLMs had significantly different average response lengths, with Bard (x̄ = 1685.24) being the longest and no difference between GPT-4 (x̄ = 827.34) and Copilot (x̄ = 904.12). Gemini's longer responses (x̄ =1291.68) included explanatory images and links. Gemini and GPT-4 correctly answered image-based questions (n = 18), unlike Copilot and Bard, highlighting their adaptability and multimodal capabilities.
Conclusion: Gemini outperformed the other LLMs in terms of accuracy, followed by GPT-4, Copilot and Bard. GPT-4, although it has the second-highest accuracy, provides concise and relevant explanations. Despite the promising performance of LLMs, medical learners should cautiously assess accuracy and decision-making reliability.
期刊介绍:
Clinical Otolaryngology is a bimonthly journal devoted to clinically-oriented research papers of the highest scientific standards dealing with:
current otorhinolaryngological practice
audiology, otology, balance, rhinology, larynx, voice and paediatric ORL
head and neck oncology
head and neck plastic and reconstructive surgery
continuing medical education and ORL training
The emphasis is on high quality new work in the clinical field and on fresh, original research.
Each issue begins with an editorial expressing the personal opinions of an individual with a particular knowledge of a chosen subject. The main body of each issue is then devoted to original papers carrying important results for those working in the field. In addition, topical review articles are published discussing a particular subject in depth, including not only the opinions of the author but also any controversies surrounding the subject.
• Negative/null results
In order for research to advance, negative results, which often make a valuable contribution to the field, should be published. However, articles containing negative or null results are frequently not considered for publication or rejected by journals. We welcome papers of this kind, where appropriate and valid power calculations are included that give confidence that a negative result can be relied upon.