提示事项：评估与佩罗尼氏病有关的大型语言模型聊天机器人回复。

IF 2 3区医学 Q1 MEDICINE, GENERAL & INTERNAL Sexual Medicine Pub Date : 2024-09-09 eCollection Date: 2024-08-01 DOI:10.1093/sexmed/qfae055

Christopher J Warren, Victoria S Edmonds, Nicolette G Payne, Sandeep Voletti, Sarah Y Wu, JennaKay Colquitt, Hossein Sadeghi-Nejad, Nahid Punjani

{"title":"提示事项：评估与佩罗尼氏病有关的大型语言模型聊天机器人回复。","authors":"Christopher J Warren, Victoria S Edmonds, Nicolette G Payne, Sandeep Voletti, Sarah Y Wu, JennaKay Colquitt, Hossein Sadeghi-Nejad, Nahid Punjani","doi":"10.1093/sexmed/qfae055","DOIUrl":null,"url":null,"abstract":"Introduction: Despite direct access to clinicians through the electronic health record, patients are increasingly turning to the internet for information related to their health, especially with sensitive urologic conditions such as Peyronie's disease (PD). Large language model (LLM) chatbots are a form of artificial intelligence that rely on user prompts to mimic conversation, and they have shown remarkable capabilities. The conversational nature of these chatbots has the potential to answer patient questions related to PD; however, the accuracy, comprehensiveness, and readability of these LLMs related to PD remain unknown.Aims: To assess the quality and readability of information generated from 4 LLMs with searches related to PD; to see if users could improve responses; and to assess the accuracy, completeness, and readability of responses to artificial preoperative patient questions sent through the electronic health record prior to undergoing PD surgery.Methods: The National Institutes of Health's frequently asked questions related to PD were entered into 4 LLMs, unprompted and prompted. The responses were evaluated for overall quality by the previously validated DISCERN questionnaire. Accuracy and completeness of LLM responses to 11 presurgical patient messages were evaluated with previously accepted Likert scales. All evaluations were performed by 3 independent reviewers in October 2023, and all reviews were repeated in April 2024. Descriptive statistics and analysis were performed.Results: Without prompting, the quality of information was moderate across all LLMs but improved to high quality with prompting. LLMs were accurate and complete, with an average score of 5.5 of 6.0 (SD, 0.8) and 2.8 of 3.0 (SD, 0.4), respectively. The average Flesch-Kincaid reading level was grade 12.9 (SD, 2.1). Chatbots were unable to communicate at a grade 8 reading level when prompted, and their citations were appropriate only 42.5% of the time.Conclusion: LLMs may become a valuable tool for patient education for PD, but they currently rely on clinical context and appropriate prompting by humans to be useful. Unfortunately, their prerequisite reading level remains higher than that of the average patient, and their citations cannot be trusted. However, given their increasing uptake and accessibility, patients and physicians should be educated on how to interact with these LLMs to elicit the most appropriate responses. In the future, LLMs may reduce burnout by helping physicians respond to patient messages.","PeriodicalId":21782,"journal":{"name":"Sexual Medicine","volume":"12 4","pages":"qfae055"},"PeriodicalIF":2.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11384107/pdf/","citationCount":"0","resultStr":"{\"title\":\"Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease.\",\"authors\":\"Christopher J Warren, Victoria S Edmonds, Nicolette G Payne, Sandeep Voletti, Sarah Y Wu, JennaKay Colquitt, Hossein Sadeghi-Nejad, Nahid Punjani\",\"doi\":\"10.1093/sexmed/qfae055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Despite direct access to clinicians through the electronic health record, patients are increasingly turning to the internet for information related to their health, especially with sensitive urologic conditions such as Peyronie's disease (PD). Large language model (LLM) chatbots are a form of artificial intelligence that rely on user prompts to mimic conversation, and they have shown remarkable capabilities. The conversational nature of these chatbots has the potential to answer patient questions related to PD; however, the accuracy, comprehensiveness, and readability of these LLMs related to PD remain unknown.Aims: To assess the quality and readability of information generated from 4 LLMs with searches related to PD; to see if users could improve responses; and to assess the accuracy, completeness, and readability of responses to artificial preoperative patient questions sent through the electronic health record prior to undergoing PD surgery.Methods: The National Institutes of Health's frequently asked questions related to PD were entered into 4 LLMs, unprompted and prompted. The responses were evaluated for overall quality by the previously validated DISCERN questionnaire. Accuracy and completeness of LLM responses to 11 presurgical patient messages were evaluated with previously accepted Likert scales. All evaluations were performed by 3 independent reviewers in October 2023, and all reviews were repeated in April 2024. Descriptive statistics and analysis were performed.Results: Without prompting, the quality of information was moderate across all LLMs but improved to high quality with prompting. LLMs were accurate and complete, with an average score of 5.5 of 6.0 (SD, 0.8) and 2.8 of 3.0 (SD, 0.4), respectively. The average Flesch-Kincaid reading level was grade 12.9 (SD, 2.1). Chatbots were unable to communicate at a grade 8 reading level when prompted, and their citations were appropriate only 42.5% of the time.Conclusion: LLMs may become a valuable tool for patient education for PD, but they currently rely on clinical context and appropriate prompting by humans to be useful. Unfortunately, their prerequisite reading level remains higher than that of the average patient, and their citations cannot be trusted. However, given their increasing uptake and accessibility, patients and physicians should be educated on how to interact with these LLMs to elicit the most appropriate responses. In the future, LLMs may reduce burnout by helping physicians respond to patient messages.\",\"PeriodicalId\":21782,\"journal\":{\"name\":\"Sexual Medicine\",\"volume\":\"12 4\",\"pages\":\"qfae055\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11384107/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sexual Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/sexmed/qfae055\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/8/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sexual Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/sexmed/qfae055","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

摘要

导言：尽管患者可以通过电子健康记录直接与临床医生联系，但他们越来越多地转向互联网获取与自身健康相关的信息，尤其是像佩罗尼氏病（PD）这样敏感的泌尿科疾病。大型语言模型（LLM）聊天机器人是一种人工智能，它依靠用户提示来模仿对话，并已显示出非凡的能力。这些聊天机器人的对话性质有可能回答患者提出的与帕金森病有关的问题；然而，这些与帕金森病有关的大型语言模型的准确性、全面性和可读性仍是未知数。目的：评估通过与帕金森病有关的搜索从 4 个大型语言模型生成的信息的质量和可读性；了解用户是否可以改进回复；评估在接受帕金森病手术前通过电子健康记录发送的人工术前患者问题回复的准确性、完整性和可读性：方法：将美国国立卫生研究院与腹腔镜手术相关的常见问题输入 4 个 LLM，包括无提示和有提示两种情况。回答的整体质量由之前验证过的 DISCERN 问卷进行评估。LLM 对 11 条术前患者信息回复的准确性和完整性采用之前认可的李克特量表进行评估。所有评估均由 3 位独立审查员于 2023 年 10 月进行，并于 2024 年 4 月再次进行审查。对结果进行了描述性统计和分析：在没有提示的情况下，所有 LLM 的信息质量都处于中等水平，但在有提示的情况下，信息质量提高到了较高水平。LLM 的准确性和完整性分别为 6.0 分中的 5.5 分（SD，0.8）和 3.0 分中的 2.8 分（SD，0.4）。Flesch-Kincaid 阅读水平平均为 12.9 级（标准差为 2.1）。聊天机器人在收到提示时无法以 8 年级的阅读水平进行交流，其引文只有 42.5% 的时间是恰当的：LLMs可能会成为对PD患者进行教育的重要工具，但它们目前需要依赖临床环境和人类的适当提示才能发挥作用。遗憾的是，LLMs 的前提阅读水平仍然高于普通患者，其引文也不可信。不过，鉴于它们的使用率和可访问性越来越高，应该教育病人和医生如何与这些 LLMs 互动，以获得最合适的反应。未来，LLMs 可能会通过帮助医生回应病人的信息来减少职业倦怠。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease.

Introduction: Despite direct access to clinicians through the electronic health record, patients are increasingly turning to the internet for information related to their health, especially with sensitive urologic conditions such as Peyronie's disease (PD). Large language model (LLM) chatbots are a form of artificial intelligence that rely on user prompts to mimic conversation, and they have shown remarkable capabilities. The conversational nature of these chatbots has the potential to answer patient questions related to PD; however, the accuracy, comprehensiveness, and readability of these LLMs related to PD remain unknown.

Aims: To assess the quality and readability of information generated from 4 LLMs with searches related to PD; to see if users could improve responses; and to assess the accuracy, completeness, and readability of responses to artificial preoperative patient questions sent through the electronic health record prior to undergoing PD surgery.

Methods: The National Institutes of Health's frequently asked questions related to PD were entered into 4 LLMs, unprompted and prompted. The responses were evaluated for overall quality by the previously validated DISCERN questionnaire. Accuracy and completeness of LLM responses to 11 presurgical patient messages were evaluated with previously accepted Likert scales. All evaluations were performed by 3 independent reviewers in October 2023, and all reviews were repeated in April 2024. Descriptive statistics and analysis were performed.

Results: Without prompting, the quality of information was moderate across all LLMs but improved to high quality with prompting. LLMs were accurate and complete, with an average score of 5.5 of 6.0 (SD, 0.8) and 2.8 of 3.0 (SD, 0.4), respectively. The average Flesch-Kincaid reading level was grade 12.9 (SD, 2.1). Chatbots were unable to communicate at a grade 8 reading level when prompted, and their citations were appropriate only 42.5% of the time.

Conclusion: LLMs may become a valuable tool for patient education for PD, but they currently rely on clinical context and appropriate prompting by humans to be useful. Unfortunately, their prerequisite reading level remains higher than that of the average patient, and their citations cannot be trusted. However, given their increasing uptake and accessibility, patients and physicians should be educated on how to interact with these LLMs to elicit the most appropriate responses. In the future, LLMs may reduce burnout by helping physicians respond to patient messages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Sexual Medicine MEDICINE, GENERAL & INTERNAL-

CiteScore

5.40

自引率

0.00%

发文量

103

审稿时长

22 weeks

期刊介绍： Sexual Medicine is an official publication of the International Society for Sexual Medicine, and serves the field as the peer-reviewed, open access journal for rapid dissemination of multidisciplinary clinical and basic research in all areas of global sexual medicine, and particularly acts as a venue for topics of regional or sub-specialty interest. The journal is focused on issues in clinical medicine and epidemiology but also publishes basic science papers with particular relevance to specific populations. Sexual Medicine offers clinicians and researchers a rapid route to publication and the opportunity to publish in a broadly distributed and highly visible global forum. The journal publishes high quality articles from all over the world and actively seeks submissions from countries with expanding sexual medicine communities. Sexual Medicine relies on the same expert panel of editors and reviewers as The Journal of Sexual Medicine and Sexual Medicine Reviews.