Christopher J Warren, Nicolette G Payne, Victoria S Edmonds, Sandeep S Voleti, Mouneeb M Choudry, Nahid Punjani, Haider M Abdul-Muhsin, Mitchell R Humphreys
{"title":"Quality of Chatbot Information Related to Benign Prostatic Hyperplasia.","authors":"Christopher J Warren, Nicolette G Payne, Victoria S Edmonds, Sandeep S Voleti, Mouneeb M Choudry, Nahid Punjani, Haider M Abdul-Muhsin, Mitchell R Humphreys","doi":"10.1002/pros.24814","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language model (LLM) chatbots, a form of artificial intelligence (AI) that excels at prompt-based interactions and mimics human conversation, have emerged as a tool for providing patients with information about urologic conditions. We aimed to examine the quality of information related to benign prostatic hyperplasia surgery from four chatbots and how they would respond to sample patient messages.</p><p><strong>Methods: </strong>We identified the top three queries in Google Trends related to \"treatment for enlarged prostate.\" These were entered into ChatGPT (OpenAI), Bard (Google), Bing AI (Microsoft), and Doximity GPT (Doximity), both unprompted and prompted for specific criteria (optimized). The chatbot-provided answers to each query were evaluated for overall quality by three urologists using the DISCERN instrument. Readability was measured with the built-in Flesch-Kincaid reading level tool in Microsoft Word. To assess the ability of chatbots to answer patient questions, we prompted the chatbots with a clinical scenario related to holmium laser enucleation of the prostate, followed by 10 questions that the National Institutes of Health recommends patients ask before surgery. Accuracy and completeness of responses were graded with Likert scales.</p><p><strong>Results: </strong>Without prompting, the quality of information was moderate across all chatbots but improved significantly with prompting (mean [SD], 3.3 [1.2] vs. 4.4 [0.7] out of 5; p < 0.001). When answering simulated patient messages, the chatbots were accurate (mean [SD], 5.6 [0.4] out of 6) and complete (mean [SD], 2.8 [0.3] out of 3). Additionally, 98% (39/40) had a median score of 5 or higher for accuracy, which corresponds to \"nearly all correct.\" The readability was poor, with a mean (SD) Flesch-Kincaid reading level grade of 12.1 (1.3) (unprompted).</p><p><strong>Conclusions: </strong>LLM chatbots hold promise for patient education, but their effectiveness is limited by the need for careful prompting from the user and by responding at a reading level higher than that of most Americans (grade 8). Educating patients and physicians on optimal LLM interaction is crucial to unlock the full potential of chatbots.</p>","PeriodicalId":54544,"journal":{"name":"Prostate","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Prostate","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pros.24814","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language model (LLM) chatbots, a form of artificial intelligence (AI) that excels at prompt-based interactions and mimics human conversation, have emerged as a tool for providing patients with information about urologic conditions. We aimed to examine the quality of information related to benign prostatic hyperplasia surgery from four chatbots and how they would respond to sample patient messages.
Methods: We identified the top three queries in Google Trends related to "treatment for enlarged prostate." These were entered into ChatGPT (OpenAI), Bard (Google), Bing AI (Microsoft), and Doximity GPT (Doximity), both unprompted and prompted for specific criteria (optimized). The chatbot-provided answers to each query were evaluated for overall quality by three urologists using the DISCERN instrument. Readability was measured with the built-in Flesch-Kincaid reading level tool in Microsoft Word. To assess the ability of chatbots to answer patient questions, we prompted the chatbots with a clinical scenario related to holmium laser enucleation of the prostate, followed by 10 questions that the National Institutes of Health recommends patients ask before surgery. Accuracy and completeness of responses were graded with Likert scales.
Results: Without prompting, the quality of information was moderate across all chatbots but improved significantly with prompting (mean [SD], 3.3 [1.2] vs. 4.4 [0.7] out of 5; p < 0.001). When answering simulated patient messages, the chatbots were accurate (mean [SD], 5.6 [0.4] out of 6) and complete (mean [SD], 2.8 [0.3] out of 3). Additionally, 98% (39/40) had a median score of 5 or higher for accuracy, which corresponds to "nearly all correct." The readability was poor, with a mean (SD) Flesch-Kincaid reading level grade of 12.1 (1.3) (unprompted).
Conclusions: LLM chatbots hold promise for patient education, but their effectiveness is limited by the need for careful prompting from the user and by responding at a reading level higher than that of most Americans (grade 8). Educating patients and physicians on optimal LLM interaction is crucial to unlock the full potential of chatbots.
期刊介绍:
The Prostate is a peer-reviewed journal dedicated to original studies of this organ and the male accessory glands. It serves as an international medium for these studies, presenting comprehensive coverage of clinical, anatomic, embryologic, physiologic, endocrinologic, and biochemical studies.