To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.

IF 1.9 3区 医学 Q2 OTORHINOLARYNGOLOGY European Archives of Oto-Rhino-Laryngology Pub Date : 2024-11-01 Epub Date: 2024-04-23 DOI:10.1007/s00405-024-08643-8
Magdalena Ostrowska, Paulina Kacała, Deborah Onolememen, Katie Vaughan-Lane, Anitta Sisily Joseph, Adam Ostrowski, Wioletta Pietruszewska, Jacek Banaszewski, Maciej J Wróbel
{"title":"To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.","authors":"Magdalena Ostrowska, Paulina Kacała, Deborah Onolememen, Katie Vaughan-Lane, Anitta Sisily Joseph, Adam Ostrowski, Wioletta Pietruszewska, Jacek Banaszewski, Maciej J Wróbel","doi":"10.1007/s00405-024-08643-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer.</p><p><strong>Methods: </strong>A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations.</p><p><strong>Results: </strong>Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length.</p><p><strong>Conclusions: </strong>LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.</p>","PeriodicalId":11952,"journal":{"name":"European Archives of Oto-Rhino-Laryngology","volume":null,"pages":null},"PeriodicalIF":1.9000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11512842/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Archives of Oto-Rhino-Laryngology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00405-024-08643-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/4/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer.

Methods: A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations.

Results: Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length.

Conclusions: LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
相信还是不相信:评估人工智能对喉癌询问的回答的可靠性和安全性。
目的:随着在线健康信息搜索量的激增,人们对可获取内容的质量和安全性越来越担忧,因为错误信息可能会对患者造成伤害。一方面,人工智能(AI)在医疗保健领域的出现可以防止这种情况的发生;另一方面,人们对所提供的医疗信息的质量和安全性提出了质疑。喉癌是一种常见的头颈部恶性肿瘤,本研究旨在评估三种大型语言模型(LLM)作为喉癌患者信息来源的实用性和安全性:使用三种大型语言模型(ChatGPT 3.5、ChatGPT 4.0 和 Bard)进行了一项横断面研究。调查问卷由 36 个有关喉癌的问题组成,分为诊断(11 个问题)、治疗(9 个问题)、新治疗方法和即将采用的治疗方法(4 个问题)、争议(8 个问题)和信息来源(4 个问题)。审阅者包括耳鼻喉科专家、初级医师和非医师等三组,他们对回答进行评分。每位医生对每个模型的每个问题评估两次,而非医生只评估一次。每个人对机型类型都是盲评,问题顺序也是随机的。结果评估基于安全性评分(1-3 分)和总体质量评分(GQS,1-5 分)。结果在 LLM 之间进行比较。研究包括迭代评估和统计验证:分析表明,ChatGPT 3.5 在安全性(平均值:2.70)和 GQS(平均值:3.95)方面得分最高。ChatGPT 4.0 和 Bard 的安全性得分较低,分别为 2.56 和 2.42,相应的质量得分分别为 3.65 和 3.38。评分者之间的可靠性保持一致,差异不到 3%。约有 4.2% 的回复属于安全性最低的类别 (1),尤其是在新颖性类别中。非医学审稿人的质量评估与回复长度呈中度相关(r = 0.67):结论:对于寻求喉癌相关信息的患者来说,LLM 是非常有价值的资源。ChatGPT 3.5 提供的回复是所有评估模型中最可靠、最安全的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
5.30
自引率
7.70%
发文量
537
审稿时长
2-4 weeks
期刊介绍: Official Journal of European Union of Medical Specialists – ORL Section and Board Official Journal of Confederation of European Oto-Rhino-Laryngology Head and Neck Surgery "European Archives of Oto-Rhino-Laryngology" publishes original clinical reports and clinically relevant experimental studies, as well as short communications presenting new results of special interest. With peer review by a respected international editorial board and prompt English-language publication, the journal provides rapid dissemination of information by authors from around the world. This particular feature makes it the journal of choice for readers who want to be informed about the continuing state of the art concerning basic sciences and the diagnosis and management of diseases of the head and neck on an international level. European Archives of Oto-Rhino-Laryngology was founded in 1864 as "Archiv für Ohrenheilkunde" by A. von Tröltsch, A. Politzer and H. Schwartze.
期刊最新文献
Correction: A novel olfactory sorting task. Scale for the assessment of mucosal wave dynamics of the free edges during stroboscopic examination: clinical validation study and analysis of results. Drug induced sleep endoscopy and simultaneous polysomnography to predict the effectiveness of mandibular advancement device in obstructive sleep apnea treatment. Endoscopic revision surgery for ossicular chain reconstruction: intraoperative findings and functional outcomes. AI in oncology: comparing the diagnostic and therapeutic potential of claude 3 opus and ChatGPT 4.0 in HNSCC management.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1