Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.
{"title":"Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.","authors":"Serhat Ermis, Ece Özal, Murat Karapapak, Ebrar Kumantaş, Sadık Altan Özal","doi":"10.3928/01913913-20240911-05","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To assess the appropriateness and readability of responses provided by four large language models (LLMs) (ChatGPT-4, Claude 3, Gemini, and Microsoft Co-pilot) to parents' queries pertaining to retinopathy of prematurity (ROP).</p><p><strong>Methods: </strong>A total of 60 frequently asked questions were collated and categorized into six distinct sections. The responses generated by the LLMs were evaluated by three experienced ROP specialists to determine their appropriateness and comprehensiveness. Additionally, the readability of the responses was assessed using a range of metrics, including the Flesch-Kincaid Grade Level (FKGL), Gunning Fog (GF) Index, Coleman-Liau (CL) Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease (FRE) score.</p><p><strong>Results: </strong>ChatGPT-4 demonstrated the highest level of appropriateness (100%) and performed exceptionally well in the Likert analysis, scoring 5 points on 96% of questions. The CL Index and FRE scores identified Gemini as the most readable LLM, whereas the GF Index and SMOG Index rated Microsoft Copilot as the most readable. Nevertheless, ChatGPT-4 exhibited the most intricate text structure, with scores of 18.56 on the GF Index, 18.56 on the CL Index, 17.2 on the SMOG Index, and 9.45 on the FRE score. This suggests that the responses demand a college-level comprehension.</p><p><strong>Conclusions: </strong>ChatGPT-4 demonstrated higher performance than other LLMs in responding to questions related to ROP; however, its texts were more complex. In terms of readability, Gemini and Microsoft Copilot were found to be more successful. <b>[<i>J Pediatr Ophthalmol Strabismus</i>. 20XX;XX(X):XXX-XXX.]</b>.</p>","PeriodicalId":50095,"journal":{"name":"Journal of Pediatric Ophthalmology & Strabismus","volume":" ","pages":"1-12"},"PeriodicalIF":1.0000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pediatric Ophthalmology & Strabismus","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3928/01913913-20240911-05","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To assess the appropriateness and readability of responses provided by four large language models (LLMs) (ChatGPT-4, Claude 3, Gemini, and Microsoft Co-pilot) to parents' queries pertaining to retinopathy of prematurity (ROP).
Methods: A total of 60 frequently asked questions were collated and categorized into six distinct sections. The responses generated by the LLMs were evaluated by three experienced ROP specialists to determine their appropriateness and comprehensiveness. Additionally, the readability of the responses was assessed using a range of metrics, including the Flesch-Kincaid Grade Level (FKGL), Gunning Fog (GF) Index, Coleman-Liau (CL) Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease (FRE) score.
Results: ChatGPT-4 demonstrated the highest level of appropriateness (100%) and performed exceptionally well in the Likert analysis, scoring 5 points on 96% of questions. The CL Index and FRE scores identified Gemini as the most readable LLM, whereas the GF Index and SMOG Index rated Microsoft Copilot as the most readable. Nevertheless, ChatGPT-4 exhibited the most intricate text structure, with scores of 18.56 on the GF Index, 18.56 on the CL Index, 17.2 on the SMOG Index, and 9.45 on the FRE score. This suggests that the responses demand a college-level comprehension.
Conclusions: ChatGPT-4 demonstrated higher performance than other LLMs in responding to questions related to ROP; however, its texts were more complex. In terms of readability, Gemini and Microsoft Copilot were found to be more successful. [J Pediatr Ophthalmol Strabismus. 20XX;XX(X):XXX-XXX.].
期刊介绍:
The Journal of Pediatric Ophthalmology & Strabismus is a bimonthly peer-reviewed publication for pediatric ophthalmologists. The Journal has published original articles on the diagnosis, treatment, and prevention of eye disorders in the pediatric age group and the treatment of strabismus in all age groups for over 50 years.