Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.

IF 3.7 2区医学 Q1 OPHTHALMOLOGY British Journal of Ophthalmology Pub Date : 2024-09-20 DOI:10.1136/bjo-2023-325054

Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Danli Shi

{"title":"Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.","authors":"Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Danli Shi","doi":"10.1136/bjo-2023-325054","DOIUrl":null,"url":null,"abstract":"Purpose: To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images.Methods: We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation.Results: Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability.Conclusion: GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2023-325054","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images.

Methods: We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation.

Results: Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability.

Conclusion: GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

揭示临床能力：用于眼科多模态图像分析的 GPT-4V(ision) 基准研究。

目的：评估基于 GPT-4V（ision）的聊天机器人在解读眼科多模态图像方面的能力：我们使用 GPT-4V 开发了一款数字眼科医生应用程序，并使用一个数据集（60 幅图像、60 种眼科疾病、6 种模式）对其性能进行了评估，其中包括裂隙灯、扫描激光眼底镜、后极眼底摄影（FPP）、光学相干断层扫描、眼底荧光素血管造影和眼部超声图像。聊天机器人对每幅图像进行了十个开放式问题测试，包括检查识别、病变检测、诊断和决策支持。人工对回答的准确性、可用性、安全性和诊断重复性进行了评估。自动评估采用句子相似性和基于 GPT-4 的自动评估：在 600 个回复中，30.6% 是准确的，21.5% 是高度可用的，55.6% 被认为是无害的。GPT-4V 在裂隙灯图像方面表现最佳，分别有 42.0%、38.5% 和 68.5% 的回答是准确、高度可用和无害的。然而，它在 FPP 图像中的表现较弱，只有 13.7%、3.7% 和 38.5%属于同一类别。GPT-4V 能正确识别 95.6% 的成像模式，在病变识别（25.6%）、诊断（16.1%）和决策支持（24.0%）方面表现出不同的准确性。GPT-4V 在诊断眼部图像方面的总体重复率为 63.3%（38/60）。GPT-4V 生成的回答与人类回答的句子相似度总体为 55.5%，准确性和可用性的斯皮尔曼相关性分别为 0.569 和 0.576：结论：GPT-4V 目前还不适用于眼科临床决策。我们的研究可作为加强眼科多模态模型的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

British Journal of Ophthalmology 医学-眼科学

CiteScore

10.30

自引率

2.40%

发文量

213

审稿时长

3-6 weeks

期刊介绍： The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.