Performance of large language models on benign prostatic hyperplasia frequently asked questions.

IF 2.6 3区医学 Q3 ENDOCRINOLOGY & METABOLISM Prostate Pub Date : 2024-06-01 Epub Date: 2024-04-01 DOI:10.1002/pros.24699

YuNing Zhang, Yijie Dong, Zihan Mei, Yiqing Hou, Minyan Wei, Yat Hin Yeung, Jiale Xu, Qing Hua, LiMei Lai, Ning Li, ShuJun Xia, Chun Zhou, JianQiao Zhou

{"title":"Performance of large language models on benign prostatic hyperplasia frequently asked questions.","authors":"YuNing Zhang, Yijie Dong, Zihan Mei, Yiqing Hou, Minyan Wei, Yat Hin Yeung, Jiale Xu, Qing Hua, LiMei Lai, Ning Li, ShuJun Xia, Chun Zhou, JianQiao Zhou","doi":"10.1002/pros.24699","DOIUrl":null,"url":null,"abstract":"Background: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.Methods: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.Results: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.Conclusions: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.","PeriodicalId":54544,"journal":{"name":"Prostate","volume":" ","pages":"807-813"},"PeriodicalIF":2.6000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Prostate","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pros.24699","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/4/1 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.

Methods: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.

Results: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.

Conclusions: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型在良性前列腺增生常见问题上的表现。

背景：良性前列腺增生症（BPH）是一种常见病，但对于普通的良性前列腺增生症患者来说，要找到可信、准确的良性前列腺增生症相关信息却很困难。我们的目标是评估和比较大型语言模型（LLM），包括 ChatGPT-3.5、ChatGPT-4 和 New Bing Chat 在回答良性前列腺增生症常见问题（FAQs）问卷时的准确性和可重复性：共有 45 个与良性前列腺增生相关的问题，分为基础知识和专业知识两类。利用三种 LLM--ChatGPT-3.5、ChatGPT-4 和 New Bing Chat 来生成对这些问题的回答。回答分为全面、正确但不充分、与不正确/过时数据混合或完全不正确。通过为每个问题生成两个回答来评估可重复性。所有回答均由经验丰富的泌尿科医生进行审核和评判：结果：所有三个 LLM 在生成对问题的回答时都表现出很高的准确性，准确率从 86.7% 到 100% 不等。然而，三者的回答准确率在统计学上并无显著差异（所有比较的 p > 0.017）。此外，法律硕士回答基础知识问题的准确率与回答专业知识问题的准确率基本相当，相差不到 3.5%（GPT-3.5：90% 对 86.7%；GPT-4：96.7% 对 95.6%；New Bing：96.7% 对 93.3%）。此外，所有三种 LLM 都表现出很高的重现性，重现率从 93.3% 到 97.8%：ChatGPT-3.5、ChatGPT-4 和 New Bing Chat 对良性前列腺增生相关问题提供了准确且可重复的回答，使它们成为提高健康素养和支持良性前列腺增生患者与医护人员合作的宝贵资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Prostate 医学-泌尿学与肾脏学

CiteScore

5.10

自引率

3.60%

发文量

180

审稿时长

1.5 months

期刊介绍： The Prostate is a peer-reviewed journal dedicated to original studies of this organ and the male accessory glands. It serves as an international medium for these studies, presenting comprehensive coverage of clinical, anatomic, embryologic, physiologic, endocrinologic, and biochemical studies.