Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.

IF 3.5 2区医学 Q1 OPHTHALMOLOGY British Journal of Ophthalmology Pub Date : 2024-09-20 DOI:10.1136/bjo-2023-324533

Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham

{"title":"Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.","authors":"Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham","doi":"10.1136/bjo-2023-324533","DOIUrl":null,"url":null,"abstract":"Background/aims: To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).Methods: We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.Results: Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10-3). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10-2). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.Conclusion: ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":" ","pages":"1443-1449"},"PeriodicalIF":3.5000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11716104/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2023-324533","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background/aims: To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).

Methods: We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.

Results: Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10^-3). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10^-2). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.

Conclusion: ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

比较基于生成和检索的聊天机器人在回答患者有关老年性黄斑变性和糖尿病视网膜病变的问题时的表现。

背景/目的比较生成型聊天机器人和检索型聊天机器人在回答患者有关年龄相关性黄斑变性（AMD）和糖尿病视网膜病变（DR）的咨询时的性能：我们在一项横断面研究中评估了四个聊天机器人：生成模型（ChatGPT-4、ChatGPT-3.5 和 Google Bard）和基于检索的模型（OcularBERT）。对它们回答 45 个问题（15 个 AMD、15 个 DR 和 15 个其他问题）的准确性进行了评估和比较。三位蒙面视网膜专家使用三点李克特量表对回答进行评分：2（良好，无差错）、1（边缘）或 0（差，有明显误差）。分数汇总后从 0 到 6 不等。根据评分者的多数共识，答复质量也被分为 "好"、"边缘 "或 "差"：总体而言，ChatGPT-4 和 ChatGPT-3.5 的表现优于其他聊天机器人，两者的中位数分数（IQR）均为 6 (1)，而 Google Bard 为 4.5 (2)，OcularBERT 为 2 (1)（所有 p 均小于 8.4×10-3）。根据共识方法，83.3% 的 ChatGPT-4 和 86.7% 的 ChatGPT-3.5 回应被评为 "好"，超过了 Google Bard（50%）和 OcularBERT（10%）（所有 p 均小于 1.4×10-2）。ChatGPT-4 和 ChatGPT-3.5 没有被评为 "差"。Google Bard 有 6.7% 的回答为 "差"，OcularBERT 有 20% 的回答为 "差"。在所有问题类型中，ChatGPT-4 仅在 AMD 方面优于 Google Bard，而 ChatGPT-3.5 在 DR 和其他方面优于 Google Bard：结论：ChatGPT-4 和 ChatGPT-3.5 表现优异，其次是 Google Bard 和 OcularBERT。生成式聊天机器人有可能能够回答原始训练之外的特定领域问题。在实际应用之前，还需要进一步的验证研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

British Journal of Ophthalmology 医学-眼科学

CiteScore

10.30

自引率

2.40%

发文量

213

审稿时长

3-6 weeks

期刊介绍： The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.