Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis

IF 3.7 3区 医学 Q1 OPHTHALMOLOGY Asia-Pacific Journal of Ophthalmology Pub Date : 2024-09-01 DOI:10.1016/j.apjo.2024.100106
Jo-Hsuan Wu , Takashi Nishida , T. Y. Alvin Liu
{"title":"Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis","authors":"Jo-Hsuan Wu ,&nbsp;Takashi Nishida ,&nbsp;T. Y. Alvin Liu","doi":"10.1016/j.apjo.2024.100106","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.</div></div><div><h3>Design</h3><div>Meta-analysis.</div></div><div><h3>Methods</h3><div>Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.</div></div><div><h3>Results</h3><div>Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).</div></div><div><h3>Conclusions</h3><div>The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.</div></div>","PeriodicalId":8594,"journal":{"name":"Asia-Pacific Journal of Ophthalmology","volume":"13 5","pages":"Article 100106"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asia-Pacific Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2162098924001178","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.

Design

Meta-analysis.

Methods

Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.

Results

Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).

Conclusions

The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大语言模型在回答眼科委员会类型问题时的准确性:元分析。
目的:评估大型语言模型(LLM)在回答眼科委员会式问题时的准确性:方法:使用PubMed和Embase在2024年3月进行文献检索:在 2024 年 3 月使用 PubMed 和 Embase 进行文献检索。我们收录了用英文发表的、报道 LLMs 在回答眼科委员会式问题时的准确性的长篇文章和研究报告。我们从单个研究中提取了有关 LLM 表现的数据,包括提交的问题数和生成的正确答案数。采用随机效应模型计算汇总准确率。根据所使用的 LLM 和所涉及的特定眼科主题进行了分组分析:在检索到的 14 项研究中,13 项(93%)针对多个眼科主题测试了 LLM。分别有 12 项(86%)、11 项(79%)、4 项(29%)和 4 项(29%)研究对 ChatGPT-3.5、ChatGPT-4、Bard 和 Bing Chat 进行了评估。LLMs 的总体汇总准确率为 0.65(95% CI:0.61-0.69)。在不同的 LLMs 中,ChatGPT-4 的汇总准确率最高,为 0.74(95% CI:0.73-0.79),而 ChatGPT-3.5 的准确率最低,为 0.52(95% CI:0.51-0.54)。LLM 在 "病理学 "方面的表现最好(0.78 [95% CI:0.70-0.86]),而在 "眼科学基础与原理 "方面的表现最差(0.52 [95% CI:0.48-0.56]):结论:LLMs 在回答眼科委员会式问题时的总体准确性可以接受,但并不出众,ChatGPT-4 和 Bing Chat 是表现最好的模型。根据测试的具体眼科主题,性能差异很大。不一致的表现令人担忧,这凸显了今后的研究需要加入带有图像的眼科板式问题,以更全面地考察 LLM 的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.10
自引率
18.20%
发文量
197
审稿时长
6 weeks
期刊介绍: The Asia-Pacific Journal of Ophthalmology, a bimonthly, peer-reviewed online scientific publication, is an official publication of the Asia-Pacific Academy of Ophthalmology (APAO), a supranational organization which is committed to research, training, learning, publication and knowledge and skill transfers in ophthalmology and visual sciences. The Asia-Pacific Journal of Ophthalmology welcomes review articles on currently hot topics, original, previously unpublished manuscripts describing clinical investigations, clinical observations and clinically relevant laboratory investigations, as well as .perspectives containing personal viewpoints on topics with broad interests. Editorials are published by invitation only. Case reports are generally not considered. The Asia-Pacific Journal of Ophthalmology covers 16 subspecialties and is freely circulated among individual members of the APAO’s member societies, which amounts to a potential readership of over 50,000.
期刊最新文献
Development and Testing of Artificial Intelligence-Based Mobile Application to Achieve Cataract Backlog-Free Status in Uttar Pradesh, India Bilateral Retinal Detachment Associated with Ascaris suum Infection Patient Perceptions Regarding the Use of Eyeglasses Among Ophthalmologists UBM helps differentiate sulcus or in-the-bag IOL placement Acute Macular Neuroretinopathy Associated with COVID-19 Pandemic: A Real-world Observation Study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1