Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.

IF 3.5 2区 医学 Q1 OPHTHALMOLOGY British Journal of Ophthalmology Pub Date : 2024-09-20 DOI:10.1136/bjo-2023-324533
Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham
{"title":"Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.","authors":"Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham","doi":"10.1136/bjo-2023-324533","DOIUrl":null,"url":null,"abstract":"<p><strong>Background/aims: </strong>To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).</p><p><strong>Methods: </strong>We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.</p><p><strong>Results: </strong>Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10<sup>-3</sup>). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10<sup>-2</sup>). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.</p><p><strong>Conclusion: </strong>ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":" ","pages":"1443-1449"},"PeriodicalIF":3.5000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11716104/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2023-324533","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background/aims: To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).

Methods: We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.

Results: Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10-3). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10-2). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.

Conclusion: ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
比较基于生成和检索的聊天机器人在回答患者有关老年性黄斑变性和糖尿病视网膜病变的问题时的表现。
背景/目的比较生成型聊天机器人和检索型聊天机器人在回答患者有关年龄相关性黄斑变性(AMD)和糖尿病视网膜病变(DR)的咨询时的性能:我们在一项横断面研究中评估了四个聊天机器人:生成模型(ChatGPT-4、ChatGPT-3.5 和 Google Bard)和基于检索的模型(OcularBERT)。对它们回答 45 个问题(15 个 AMD、15 个 DR 和 15 个其他问题)的准确性进行了评估和比较。三位蒙面视网膜专家使用三点李克特量表对回答进行评分:2(良好,无差错)、1(边缘)或 0(差,有明显误差)。分数汇总后从 0 到 6 不等。根据评分者的多数共识,答复质量也被分为 "好"、"边缘 "或 "差":总体而言,ChatGPT-4 和 ChatGPT-3.5 的表现优于其他聊天机器人,两者的中位数分数(IQR)均为 6 (1),而 Google Bard 为 4.5 (2),OcularBERT 为 2 (1)(所有 p 均小于 8.4×10-3)。根据共识方法,83.3% 的 ChatGPT-4 和 86.7% 的 ChatGPT-3.5 回应被评为 "好",超过了 Google Bard(50%)和 OcularBERT(10%)(所有 p 均小于 1.4×10-2)。ChatGPT-4 和 ChatGPT-3.5 没有被评为 "差"。Google Bard 有 6.7% 的回答为 "差",OcularBERT 有 20% 的回答为 "差"。在所有问题类型中,ChatGPT-4 仅在 AMD 方面优于 Google Bard,而 ChatGPT-3.5 在 DR 和其他方面优于 Google Bard:结论:ChatGPT-4 和 ChatGPT-3.5 表现优异,其次是 Google Bard 和 OcularBERT。生成式聊天机器人有可能能够回答原始训练之外的特定领域问题。在实际应用之前,还需要进一步的验证研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
10.30
自引率
2.40%
发文量
213
审稿时长
3-6 weeks
期刊介绍: The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.
期刊最新文献
Corneal endothelial cell density changes after Preserflo MicroShunt implantation. Artificial intelligence-based fluid quantification predicts clinical outcomes in diabetic macular oedema eyes treated with intravitreal dexamethasone implants: DIADEMA project. Correction: Quality assessment of colour fundus and fluorescein angiography images using deep learning. Advanced analysis of leading large language models for diagnostic accuracy in retinal imaging. Orbital inflammation in VEXAS syndrome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1