Utilization of ChatGPT for Rhinology Patient Education: Limitations in a Surgical Sub-Specialty.

IF 1.8 Q2 OTORHINOLARYNGOLOGY OTO Open Pub Date : 2025-01-07 eCollection Date: 2025-01-01 DOI:10.1002/oto2.70065

Alice E Huang, Michael T Chang, Ashoke Khanwalkar, Carol H Yan, Katie M Phillips, Michael J Yong, Jayakar V Nayak, Peter H Hwang, Zara M Patel

{"title":"Utilization of ChatGPT for Rhinology Patient Education: Limitations in a Surgical Sub-Specialty.","authors":"Alice E Huang, Michael T Chang, Ashoke Khanwalkar, Carol H Yan, Katie M Phillips, Michael J Yong, Jayakar V Nayak, Peter H Hwang, Zara M Patel","doi":"10.1002/oto2.70065","DOIUrl":null,"url":null,"abstract":"Objective: To analyze the accuracy of ChatGPT-generated responses to common rhinologic patient questions.Methods: Ten common questions from rhinology patients were compiled by a panel of 4 rhinology fellowship-trained surgeons based on clinical patient experience. This panel (Panel 1) developed consensus \"expert\" responses to each question. Questions were individually posed to ChatGPT (version 3.5) and its responses recorded. ChatGPT-generated responses were individually graded by Panel 1 on a scale of 0 (incorrect) to 3 (correct and exceeding the quality of expert responses). A 2nd panel was given the consensus and ChatGPT responses to each question and asked to guess which response corresponded to which source. They then graded ChatGPT responses using the same criteria as Panel 1. Question-specific and overall mean grades for ChatGPT responses, as well as interclass correlation coefficient (ICC) as a measure of interrater reliability, were calculated.Results: The overall mean grade for ChatGPT responses was 1.65/3. For 2 out of 10 questions, ChatGPT responses were equal to or better than expert responses. However, for 8 out of 10 questions, ChatGPT provided responses that were incorrect, false, or incomplete based on mean rater grades. Overall ICC was 0.526, indicating moderate reliability among raters of ChatGPT responses. Reviewers were able to discern ChatGPT from human responses with 97.5% accuracy.Conclusion: This preliminary study demonstrates overall near-complete and variably accurate responses provided by ChatGPT to common rhinologic questions, demonstrating important limitations in nuanced subspecialty fields.","PeriodicalId":19697,"journal":{"name":"OTO Open","volume":"9 1","pages":"e70065"},"PeriodicalIF":1.8000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705442/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"OTO Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/oto2.70065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To analyze the accuracy of ChatGPT-generated responses to common rhinologic patient questions.

Methods: Ten common questions from rhinology patients were compiled by a panel of 4 rhinology fellowship-trained surgeons based on clinical patient experience. This panel (Panel 1) developed consensus "expert" responses to each question. Questions were individually posed to ChatGPT (version 3.5) and its responses recorded. ChatGPT-generated responses were individually graded by Panel 1 on a scale of 0 (incorrect) to 3 (correct and exceeding the quality of expert responses). A 2nd panel was given the consensus and ChatGPT responses to each question and asked to guess which response corresponded to which source. They then graded ChatGPT responses using the same criteria as Panel 1. Question-specific and overall mean grades for ChatGPT responses, as well as interclass correlation coefficient (ICC) as a measure of interrater reliability, were calculated.

Results: The overall mean grade for ChatGPT responses was 1.65/3. For 2 out of 10 questions, ChatGPT responses were equal to or better than expert responses. However, for 8 out of 10 questions, ChatGPT provided responses that were incorrect, false, or incomplete based on mean rater grades. Overall ICC was 0.526, indicating moderate reliability among raters of ChatGPT responses. Reviewers were able to discern ChatGPT from human responses with 97.5% accuracy.

Conclusion: This preliminary study demonstrates overall near-complete and variably accurate responses provided by ChatGPT to common rhinologic questions, demonstrating important limitations in nuanced subspecialty fields.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ChatGPT在鼻科患者教育中的应用：外科亚专科的局限性。

目的：分析chatgpt对常见鼻科患者问题的回答的准确性。方法：由4名接受过奖学金培训的鼻外科医生组成的小组根据临床患者经验汇编了来自鼻科患者的10个常见问题。这个小组（小组1）对每个问题提出了一致的“专家”回答。向ChatGPT（版本3.5）单独提出问题，并记录其回答。chatgpt生成的回答由小组1在0（不正确）到3（正确且超过专家回答的质量）的范围内单独评分。第二个小组给出了对每个问题的共识和ChatGPT回答，并要求猜测哪个回答对应于哪个来源。然后，他们使用与小组1相同的标准对ChatGPT的回答进行评分。计算了ChatGPT回答的特定问题和总体平均等级，以及作为衡量可信度的类间相关系数（ICC）。结果：ChatGPT反应的总体平均评分为1.65/3。10个问题中有2个，ChatGPT的回答等于或优于专家的回答。然而，在10个问题中有8个，ChatGPT提供的答案是不正确的、错误的或不完整的，基于平均评分。总体ICC为0.526，表明ChatGPT评分者的可靠性中等。审稿人能够以97.5%的准确率从人类反应中识别出ChatGPT。结论：这项初步研究表明，ChatGPT对常见的鼻科学问题提供了总体上接近完整和可变准确的反应，显示了细微差别的亚专业领域的重要局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊