{"title":"大语言模型在回答眼科委员会类型问题时的准确性:元分析。","authors":"Jo-Hsuan Wu , Takashi Nishida , T. Y. Alvin Liu","doi":"10.1016/j.apjo.2024.100106","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.</div></div><div><h3>Design</h3><div>Meta-analysis.</div></div><div><h3>Methods</h3><div>Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.</div></div><div><h3>Results</h3><div>Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).</div></div><div><h3>Conclusions</h3><div>The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.</div></div>","PeriodicalId":8594,"journal":{"name":"Asia-Pacific Journal of Ophthalmology","volume":"13 5","pages":"Article 100106"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis\",\"authors\":\"Jo-Hsuan Wu , Takashi Nishida , T. Y. Alvin Liu\",\"doi\":\"10.1016/j.apjo.2024.100106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.</div></div><div><h3>Design</h3><div>Meta-analysis.</div></div><div><h3>Methods</h3><div>Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.</div></div><div><h3>Results</h3><div>Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).</div></div><div><h3>Conclusions</h3><div>The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.</div></div>\",\"PeriodicalId\":8594,\"journal\":{\"name\":\"Asia-Pacific Journal of Ophthalmology\",\"volume\":\"13 5\",\"pages\":\"Article 100106\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Asia-Pacific Journal of Ophthalmology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2162098924001178\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asia-Pacific Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2162098924001178","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis
Purpose
To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions.
Design
Meta-analysis.
Methods
Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed.
Results
Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61–0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73–0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51–0.54). LLMs performed best in “pathology” (0.78 [95 % CI: 0.70–0.86]) and worst in “fundamentals and principles of ophthalmology” (0.52 [95 % CI: 0.48–0.56]).
Conclusions
The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
期刊介绍:
The Asia-Pacific Journal of Ophthalmology, a bimonthly, peer-reviewed online scientific publication, is an official publication of the Asia-Pacific Academy of Ophthalmology (APAO), a supranational organization which is committed to research, training, learning, publication and knowledge and skill transfers in ophthalmology and visual sciences. The Asia-Pacific Journal of Ophthalmology welcomes review articles on currently hot topics, original, previously unpublished manuscripts describing clinical investigations, clinical observations and clinically relevant laboratory investigations, as well as .perspectives containing personal viewpoints on topics with broad interests. Editorials are published by invitation only. Case reports are generally not considered. The Asia-Pacific Journal of Ophthalmology covers 16 subspecialties and is freely circulated among individual members of the APAO’s member societies, which amounts to a potential readership of over 50,000.