YuNing Zhang, Yijie Dong, Zihan Mei, Yiqing Hou, Minyan Wei, Yat Hin Yeung, Jiale Xu, Qing Hua, LiMei Lai, Ning Li, ShuJun Xia, Chun Zhou, JianQiao Zhou
{"title":"Performance of large language models on benign prostatic hyperplasia frequently asked questions.","authors":"YuNing Zhang, Yijie Dong, Zihan Mei, Yiqing Hou, Minyan Wei, Yat Hin Yeung, Jiale Xu, Qing Hua, LiMei Lai, Ning Li, ShuJun Xia, Chun Zhou, JianQiao Zhou","doi":"10.1002/pros.24699","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.</p><p><strong>Methods: </strong>A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.</p><p><strong>Results: </strong>All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.</p><p><strong>Conclusions: </strong>ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.</p>","PeriodicalId":54544,"journal":{"name":"Prostate","volume":" ","pages":"807-813"},"PeriodicalIF":2.6000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Prostate","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pros.24699","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/4/1 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.
Methods: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.
Results: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.
Conclusions: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.
期刊介绍:
The Prostate is a peer-reviewed journal dedicated to original studies of this organ and the male accessory glands. It serves as an international medium for these studies, presenting comprehensive coverage of clinical, anatomic, embryologic, physiologic, endocrinologic, and biochemical studies.