Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.

IF 1 4区医学 Q4 OPHTHALMOLOGY Journal of Pediatric Ophthalmology & Strabismus Pub Date : 2024-10-28 DOI:10.3928/01913913-20240911-05

Serhat Ermis, Ece Özal, Murat Karapapak, Ebrar Kumantaş, Sadık Altan Özal

{"title":"Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.","authors":"Serhat Ermis, Ece Özal, Murat Karapapak, Ebrar Kumantaş, Sadık Altan Özal","doi":"10.3928/01913913-20240911-05","DOIUrl":null,"url":null,"abstract":"Purpose: To assess the appropriateness and readability of responses provided by four large language models (LLMs) (ChatGPT-4, Claude 3, Gemini, and Microsoft Co-pilot) to parents' queries pertaining to retinopathy of prematurity (ROP).Methods: A total of 60 frequently asked questions were collated and categorized into six distinct sections. The responses generated by the LLMs were evaluated by three experienced ROP specialists to determine their appropriateness and comprehensiveness. Additionally, the readability of the responses was assessed using a range of metrics, including the Flesch-Kincaid Grade Level (FKGL), Gunning Fog (GF) Index, Coleman-Liau (CL) Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease (FRE) score.Results: ChatGPT-4 demonstrated the highest level of appropriateness (100%) and performed exceptionally well in the Likert analysis, scoring 5 points on 96% of questions. The CL Index and FRE scores identified Gemini as the most readable LLM, whereas the GF Index and SMOG Index rated Microsoft Copilot as the most readable. Nevertheless, ChatGPT-4 exhibited the most intricate text structure, with scores of 18.56 on the GF Index, 18.56 on the CL Index, 17.2 on the SMOG Index, and 9.45 on the FRE score. This suggests that the responses demand a college-level comprehension.Conclusions: ChatGPT-4 demonstrated higher performance than other LLMs in responding to questions related to ROP; however, its texts were more complex. In terms of readability, Gemini and Microsoft Copilot were found to be more successful. [J Pediatr Ophthalmol Strabismus. 20XX;XX(X):XXX-XXX.].","PeriodicalId":50095,"journal":{"name":"Journal of Pediatric Ophthalmology & Strabismus","volume":" ","pages":"1-12"},"PeriodicalIF":1.0000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pediatric Ophthalmology & Strabismus","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3928/01913913-20240911-05","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To assess the appropriateness and readability of responses provided by four large language models (LLMs) (ChatGPT-4, Claude 3, Gemini, and Microsoft Co-pilot) to parents' queries pertaining to retinopathy of prematurity (ROP).

Methods: A total of 60 frequently asked questions were collated and categorized into six distinct sections. The responses generated by the LLMs were evaluated by three experienced ROP specialists to determine their appropriateness and comprehensiveness. Additionally, the readability of the responses was assessed using a range of metrics, including the Flesch-Kincaid Grade Level (FKGL), Gunning Fog (GF) Index, Coleman-Liau (CL) Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease (FRE) score.

Results: ChatGPT-4 demonstrated the highest level of appropriateness (100%) and performed exceptionally well in the Likert analysis, scoring 5 points on 96% of questions. The CL Index and FRE scores identified Gemini as the most readable LLM, whereas the GF Index and SMOG Index rated Microsoft Copilot as the most readable. Nevertheless, ChatGPT-4 exhibited the most intricate text structure, with scores of 18.56 on the GF Index, 18.56 on the CL Index, 17.2 on the SMOG Index, and 9.45 on the FRE score. This suggests that the responses demand a college-level comprehension.

Conclusions: ChatGPT-4 demonstrated higher performance than other LLMs in responding to questions related to ROP; however, its texts were more complex. In terms of readability, Gemini and Microsoft Copilot were found to be more successful. [J Pediatr Ophthalmol Strabismus. 20XX;XX(X):XXX-XXX.].

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估大型语言模型（ChatGPT-4、Claude 3、Gemini 和 Microsoft Copilot）对早产儿视网膜病变常见问题的反应：关于可读性和适宜性的研究。

目的：评估四种大型语言模型（LLMs）（ChatGPT-4、Claude 3、Gemini 和 Microsoft Co-pilot）对家长有关早产儿视网膜病变（ROP）的询问所做回答的适当性和可读性：方法：共整理了 60 个常见问题，并将其分为六个不同的部分。由三位经验丰富的早产儿视网膜病变专家对 LLM 生成的回复进行评估，以确定其是否恰当和全面。此外，还使用了一系列指标来评估回复的可读性，包括弗莱什-金凯德分级（FKGL）、冈宁雾（GF）指数、科尔曼-利亚（CL）指数、简单拗口指数（SMOG）和弗莱什阅读容易度（FRE）评分：ChatGPT-4 的合适度最高（100%），在李克特分析中表现优异，96% 的问题都得了 5 分。CL 指数和 FRE 分数将 Gemini 评为可读性最高的 LLM，而 GF 指数和 SMOG 指数则将 Microsoft Copilot 评为可读性最高的 LLM。然而，ChatGPT-4 的文本结构最为复杂，GF 指数为 18.56，CL 指数为 18.56，SMOG 指数为 17.2，FRE 分数为 9.45。这表明这些回答需要大学水平的理解能力：结论：ChatGPT-4 在回答有关 ROP 的问题时表现出比其他 LLM 更高的性能，但其文本更为复杂。就可读性而言，Gemini 和 Microsoft Copilot 更为成功。[20XX;XX(X):XXX-XXX.].

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Pediatric Ophthalmology & Strabismus 医学-小儿科

CiteScore

1.80

自引率

8.30%

发文量

115

审稿时长

>12 weeks

期刊介绍： The Journal of Pediatric Ophthalmology & Strabismus is a bimonthly peer-reviewed publication for pediatric ophthalmologists. The Journal has published original articles on the diagnosis, treatment, and prevention of eye disorders in the pediatric age group and the treatment of strabismus in all age groups for over 50 years.