A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis

Natalie D. Cohen MD, Milan Ho BS, Donald McIntire PhD, Katherine Smith MD, Kimberly A. Kho MD
{"title":"A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis","authors":"Natalie D. Cohen MD,&nbsp;Milan Ho BS,&nbsp;Donald McIntire PhD,&nbsp;Katherine Smith MD,&nbsp;Kimberly A. Kho MD","doi":"10.1016/j.xagr.2024.100405","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.</div></div><div><h3>Objective</h3><div>This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.</div></div><div><h3>Study Design</h3><div>Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's <em>W</em> and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.</div></div><div><h3>Results</h3><div>Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.</div></div><div><h3>Conclusion</h3><div>The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.</div></div>","PeriodicalId":72141,"journal":{"name":"AJOG global reports","volume":"5 1","pages":"Article 100405"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730533/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AJOG global reports","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666577824000996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them.

Objective

This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them.

Study Design

Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers’ strength of agreement in ranking the LLMs’ responses for each item.

Results

Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence.

Conclusion

The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
领先聊天机器人对子宫内膜异位症问题的生成式人工智能反应的比较分析。
导读:生成式人工智能(AI)的使用已经开始渗透到包括医学在内的大多数行业,患者将不可避免地开始使用这些大型语言模型(LLM)聊天机器人作为教育的一种方式。随着医疗信息技术的发展,有必要评估聊天机器人及其向患者提供的信息的准确性,并确定它们之间是否存在可变性。目的:本研究旨在评估三种聊天机器人在解决子宫内膜异位症相关问题时的准确性和全面性,并确定它们之间的可变性水平。研究设计:三位法学硕士,包括Chat GPT-4 (Open AI)、Claude (Anthropic)和Bard (b谷歌),被要求回答关于子宫内膜异位症的10个常见问题。这些反应与目前的子宫内膜异位症指南和专家意见进行了定性比较,并由9位妇科医生进行了评分。评分标准包括:(1)完全不正确,(2)大部分不正确,部分正确,(3)大部分正确,部分不正确,(4)正确但不充分,(5)正确且全面。最后的分数由9位评论者取平均值。采用Kendall’s W和相关的卡方检验来评价评论者对法学硕士各项目回答排序的一致程度。结果:巴德、Chat GPT和克劳德的10个答案的平均得分分别为3.69分、4.24分和3.7分。有两个问题显示了9位审稿人之间的重大分歧。没有问题的模型可以全面或正确地回答审稿人。与全面和正确的反应最相关的模型是ChatGPT。聊天机器人在准确回答有关症状和病理生理的问题以及治疗和复发风险方面的能力有所提高。结论:对llm的分析显示,平均而言,他们主要对子宫内膜异位症患者的常见问题提供了正确但不充分的回答。虽然聊天机器人的回答可以作为有执照的医疗专业人员提供的信息的有价值的补充,但至关重要的是要对产出保持一个彻底的持续评估过程,以便向患者提供最全面和最准确的信息。随着生成式人工智能越来越多地融入医疗领域,进一步研究这项技术及其在患者教育和治疗中的作用至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
AJOG global reports
AJOG global reports Endocrinology, Diabetes and Metabolism, Obstetrics, Gynecology and Women's Health, Perinatology, Pediatrics and Child Health, Urology
CiteScore
1.20
自引率
0.00%
发文量
0
期刊最新文献
Ghana abortion care—a model for others: analysis of the 2017 Ghana Maternal Health Survey Utilizing machine learning to predict the risk factors of episiotomy in parturient women Immediate postplacental intrauterine device placement: retrospective cohort study of expulsion and associated risk factors Effect of maternal beta-blocker treatment on mean fetal heart rate Balancing screen time during pregnancy: implications for maternal and fetal health
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1