Performance of large language models on benign prostatic hyperplasia frequently asked questions.

IF 2.6 3区 医学 Q3 ENDOCRINOLOGY & METABOLISM Prostate Pub Date : 2024-06-01 Epub Date: 2024-04-01 DOI:10.1002/pros.24699
YuNing Zhang, Yijie Dong, Zihan Mei, Yiqing Hou, Minyan Wei, Yat Hin Yeung, Jiale Xu, Qing Hua, LiMei Lai, Ning Li, ShuJun Xia, Chun Zhou, JianQiao Zhou
{"title":"Performance of large language models on benign prostatic hyperplasia frequently asked questions.","authors":"YuNing Zhang, Yijie Dong, Zihan Mei, Yiqing Hou, Minyan Wei, Yat Hin Yeung, Jiale Xu, Qing Hua, LiMei Lai, Ning Li, ShuJun Xia, Chun Zhou, JianQiao Zhou","doi":"10.1002/pros.24699","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.</p><p><strong>Methods: </strong>A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.</p><p><strong>Results: </strong>All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.</p><p><strong>Conclusions: </strong>ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.</p>","PeriodicalId":54544,"journal":{"name":"Prostate","volume":" ","pages":"807-813"},"PeriodicalIF":2.6000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Prostate","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pros.24699","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/4/1 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.

Methods: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.

Results: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.

Conclusions: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型语言模型在良性前列腺增生常见问题上的表现。
背景:良性前列腺增生症(BPH)是一种常见病,但对于普通的良性前列腺增生症患者来说,要找到可信、准确的良性前列腺增生症相关信息却很困难。我们的目标是评估和比较大型语言模型(LLM),包括 ChatGPT-3.5、ChatGPT-4 和 New Bing Chat 在回答良性前列腺增生症常见问题(FAQs)问卷时的准确性和可重复性:共有 45 个与良性前列腺增生相关的问题,分为基础知识和专业知识两类。利用三种 LLM--ChatGPT-3.5、ChatGPT-4 和 New Bing Chat 来生成对这些问题的回答。回答分为全面、正确但不充分、与不正确/过时数据混合或完全不正确。通过为每个问题生成两个回答来评估可重复性。所有回答均由经验丰富的泌尿科医生进行审核和评判:结果:所有三个 LLM 在生成对问题的回答时都表现出很高的准确性,准确率从 86.7% 到 100% 不等。然而,三者的回答准确率在统计学上并无显著差异(所有比较的 p > 0.017)。此外,法律硕士回答基础知识问题的准确率与回答专业知识问题的准确率基本相当,相差不到 3.5%(GPT-3.5:90% 对 86.7%;GPT-4:96.7% 对 95.6%;New Bing:96.7% 对 93.3%)。此外,所有三种 LLM 都表现出很高的重现性,重现率从 93.3% 到 97.8%:ChatGPT-3.5、ChatGPT-4 和 New Bing Chat 对良性前列腺增生相关问题提供了准确且可重复的回答,使它们成为提高健康素养和支持良性前列腺增生患者与医护人员合作的宝贵资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Prostate
Prostate 医学-泌尿学与肾脏学
CiteScore
5.10
自引率
3.60%
发文量
180
审稿时长
1.5 months
期刊介绍: The Prostate is a peer-reviewed journal dedicated to original studies of this organ and the male accessory glands. It serves as an international medium for these studies, presenting comprehensive coverage of clinical, anatomic, embryologic, physiologic, endocrinologic, and biochemical studies.
期刊最新文献
L1CAM mediates neuroendocrine phenotype acquisition in prostate cancer cells. Modern predictors and management of incidental prostate cancer at holmium enucleation of prostate. Effectiveness of androgen receptor pathway inhibitors and proton pump inhibitors. Reply to Letter to the Editor on "Impact of proton pump inhibitors on the efficacy of androgen receptor signaling inhibitors in metastatic castration-resistant prostate cancer patients". Bimodal imaging: Detection rate of clinically significant prostate cancer is higher in MRI lesions visible to transrectal ultrasound.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1