Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis

IF 4.5 3区医学 Q1 OPHTHALMOLOGY Asia-Pacific Journal of Ophthalmology Pub Date : 2024-09-01 DOI:10.1016/j.apjo.2024.100106

Jo-Hsuan Wu , Takashi Nishida , T. Y. Alvin Liu

{"title":"Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis","authors":"Jo-Hsuan Wu , Takashi Nishida , T. Y. Alvin Liu","doi":"10.1016/j.apjo.2024.100106","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.</div></div><div><h3>Design</h3><div>Meta-analysis.</div></div><div><h3>Methods</h3><div>Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.</div></div><div><h3>Results</h3><div>Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).</div></div><div><h3>Conclusions</h3><div>The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.</div></div>","PeriodicalId":8594,"journal":{"name":"Asia-Pacific Journal of Ophthalmology","volume":"13 5","pages":"Article 100106"},"PeriodicalIF":4.5000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asia-Pacific Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2162098924001178","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.

Design

Meta-analysis.

Methods

Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.

Results

Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).

Conclusions

The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大语言模型在回答眼科委员会类型问题时的准确性：元分析。

目的：评估大型语言模型（LLM）在回答眼科委员会式问题时的准确性：方法：使用PubMed和Embase在2024年3月进行文献检索：在 2024 年 3 月使用 PubMed 和 Embase 进行文献检索。我们收录了用英文发表的、报道 LLMs 在回答眼科委员会式问题时的准确性的长篇文章和研究报告。我们从单个研究中提取了有关 LLM 表现的数据，包括提交的问题数和生成的正确答案数。采用随机效应模型计算汇总准确率。根据所使用的 LLM 和所涉及的特定眼科主题进行了分组分析：在检索到的 14 项研究中，13 项（93%）针对多个眼科主题测试了 LLM。分别有 12 项（86%）、11 项（79%）、4 项（29%）和 4 项（29%）研究对 ChatGPT-3.5、ChatGPT-4、Bard 和 Bing Chat 进行了评估。LLMs 的总体汇总准确率为 0.65（95% CI：0.61-0.69）。在不同的 LLMs 中，ChatGPT-4 的汇总准确率最高，为 0.74（95% CI：0.73-0.79），而 ChatGPT-3.5 的准确率最低，为 0.52（95% CI：0.51-0.54）。LLM 在 "病理学 "方面的表现最好（0.78 [95% CI：0.70-0.86]），而在 "眼科学基础与原理 "方面的表现最差（0.52 [95% CI：0.48-0.56]）：结论：LLMs 在回答眼科委员会式问题时的总体准确性可以接受，但并不出众，ChatGPT-4 和 Bing Chat 是表现最好的模型。根据测试的具体眼科主题，性能差异很大。不一致的表现令人担忧，这凸显了今后的研究需要加入带有图像的眼科板式问题，以更全面地考察 LLM 的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Asia-Pacific Journal of Ophthalmology OPHTHALMOLOGY-

CiteScore

8.10

自引率

18.20%

发文量

197

审稿时长

6 weeks

期刊介绍： The Asia-Pacific Journal of Ophthalmology, a bimonthly, peer-reviewed online scientific publication, is an official publication of the Asia-Pacific Academy of Ophthalmology (APAO), a supranational organization which is committed to research, training, learning, publication and knowledge and skill transfers in ophthalmology and visual sciences. The Asia-Pacific Journal of Ophthalmology welcomes review articles on currently hot topics, original, previously unpublished manuscripts describing clinical investigations, clinical observations and clinically relevant laboratory investigations, as well as .perspectives containing personal viewpoints on topics with broad interests. Editorials are published by invitation only. Case reports are generally not considered. The Asia-Pacific Journal of Ophthalmology covers 16 subspecialties and is freely circulated among individual members of the APAO’s member societies, which amounts to a potential readership of over 50,000.