Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.

IF 3.7 2区 医学 Q1 OPHTHALMOLOGY British Journal of Ophthalmology Pub Date : 2024-09-20 DOI:10.1136/bjo-2023-325054
Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Danli Shi
{"title":"Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.","authors":"Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Danli Shi","doi":"10.1136/bjo-2023-325054","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images.</p><p><strong>Methods: </strong>We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation.</p><p><strong>Results: </strong>Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability.</p><p><strong>Conclusion: </strong>GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2023-325054","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images.

Methods: We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation.

Results: Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability.

Conclusion: GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
揭示临床能力:用于眼科多模态图像分析的 GPT-4V(ision) 基准研究。
目的:评估基于 GPT-4V(ision)的聊天机器人在解读眼科多模态图像方面的能力:我们使用 GPT-4V 开发了一款数字眼科医生应用程序,并使用一个数据集(60 幅图像、60 种眼科疾病、6 种模式)对其性能进行了评估,其中包括裂隙灯、扫描激光眼底镜、后极眼底摄影(FPP)、光学相干断层扫描、眼底荧光素血管造影和眼部超声图像。聊天机器人对每幅图像进行了十个开放式问题测试,包括检查识别、病变检测、诊断和决策支持。人工对回答的准确性、可用性、安全性和诊断重复性进行了评估。自动评估采用句子相似性和基于 GPT-4 的自动评估:在 600 个回复中,30.6% 是准确的,21.5% 是高度可用的,55.6% 被认为是无害的。GPT-4V 在裂隙灯图像方面表现最佳,分别有 42.0%、38.5% 和 68.5% 的回答是准确、高度可用和无害的。然而,它在 FPP 图像中的表现较弱,只有 13.7%、3.7% 和 38.5%属于同一类别。GPT-4V 能正确识别 95.6% 的成像模式,在病变识别(25.6%)、诊断(16.1%)和决策支持(24.0%)方面表现出不同的准确性。GPT-4V 在诊断眼部图像方面的总体重复率为 63.3%(38/60)。GPT-4V 生成的回答与人类回答的句子相似度总体为 55.5%,准确性和可用性的斯皮尔曼相关性分别为 0.569 和 0.576:结论:GPT-4V 目前还不适用于眼科临床决策。我们的研究可作为加强眼科多模态模型的基准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
10.30
自引率
2.40%
发文量
213
审稿时长
3-6 weeks
期刊介绍: The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.
期刊最新文献
Short-term intraocular pressure changes after intravitreal aflibercept 2 mg, aflibercept 8 mg and faricimab: a prospective, comparative study Colour vision deficiency is associated with increased prevalence of amblyopia, strabismus and ametropia: a large population study Nyctohemeral effects of topical beta-adrenoceptor blocking agents measured with an intraocular telemetry sensor Intracellular dark spots are associated with endothelial cell loss after Descemet’s stripping automated endothelial keratoplasty At a glance
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1