Emily S Johnson, Eva K Welch, Jacqueline Kikuchi, Heather Barbier, Christine M Vaccaro, Felicia Balzano, Katherine L Dengler
{"title":"Use of ChatGPT to Generate Informed Consent for Surgery in Urogynecology.","authors":"Emily S Johnson, Eva K Welch, Jacqueline Kikuchi, Heather Barbier, Christine M Vaccaro, Felicia Balzano, Katherine L Dengler","doi":"10.1097/SPV.0000000000001638","DOIUrl":null,"url":null,"abstract":"<p><strong>Importance: </strong>Use of the publicly available Large Language Model, Chat Generative Pre-trained Transformer (ChatGPT 3.5; OpenAI, 2022), is growing in health care despite varying accuracies.</p><p><strong>Objective: </strong>The aim of this study was to assess the accuracy and readability of ChatGPT's responses to questions encompassing surgical informed consent in urogynecology.</p><p><strong>Study design: </strong>Five fellowship-trained urogynecology attending physicians and 1 reconstructive female urologist evaluated ChatGPT's responses to questions about 4 surgical procedures: (1) retropubic midurethral sling, (2) total vaginal hysterectomy, (3) uterosacral ligament suspension, and (4) sacrocolpopexy. Questions involved procedure descriptions, risks/benefits/alternatives, and additional resources. Responses were rated using the DISCERN tool, a 4-point accuracy scale, and the Flesch-Kinkaid Grade Level score.</p><p><strong>Results: </strong>The median DISCERN tool overall rating was 3 (interquartile range [IQR], 3-4), indicating a moderate rating (\"potentially important but not serious shortcomings\"). Retropubic midurethral sling received the highest overall score (median, 4; IQR, 3-4), and uterosacral ligament suspension received the lowest (median, 3; IQR, 3-3). Using the 4-point accuracy scale, 44.0% of responses received a score of 4 (\"correct and adequate\"), 22.6% received a score of 3 (\"correct but insufficient\"), 29.8% received a score of 2 (\"accurate and misleading information together\"), and 3.6% received a score of 1 (\"wrong or irrelevant answer\"). ChatGPT performance was poor for discussion of benefits and alternatives for all surgical procedures, with some responses being inaccurate. The mean Flesch-Kinkaid Grade Level score for all responses was 17.5 (SD, 2.1), corresponding to a postgraduate reading level.</p><p><strong>Conclusions: </strong>Overall, ChatGPT generated accurate responses to questions about surgical informed consent. However, it produced clearly false portions of responses, highlighting the need for a careful review of responses by qualified health care professionals.</p>","PeriodicalId":75288,"journal":{"name":"Urogynecology (Hagerstown, Md.)","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Urogynecology (Hagerstown, Md.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/SPV.0000000000001638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Importance: Use of the publicly available Large Language Model, Chat Generative Pre-trained Transformer (ChatGPT 3.5; OpenAI, 2022), is growing in health care despite varying accuracies.
Objective: The aim of this study was to assess the accuracy and readability of ChatGPT's responses to questions encompassing surgical informed consent in urogynecology.
Study design: Five fellowship-trained urogynecology attending physicians and 1 reconstructive female urologist evaluated ChatGPT's responses to questions about 4 surgical procedures: (1) retropubic midurethral sling, (2) total vaginal hysterectomy, (3) uterosacral ligament suspension, and (4) sacrocolpopexy. Questions involved procedure descriptions, risks/benefits/alternatives, and additional resources. Responses were rated using the DISCERN tool, a 4-point accuracy scale, and the Flesch-Kinkaid Grade Level score.
Results: The median DISCERN tool overall rating was 3 (interquartile range [IQR], 3-4), indicating a moderate rating ("potentially important but not serious shortcomings"). Retropubic midurethral sling received the highest overall score (median, 4; IQR, 3-4), and uterosacral ligament suspension received the lowest (median, 3; IQR, 3-3). Using the 4-point accuracy scale, 44.0% of responses received a score of 4 ("correct and adequate"), 22.6% received a score of 3 ("correct but insufficient"), 29.8% received a score of 2 ("accurate and misleading information together"), and 3.6% received a score of 1 ("wrong or irrelevant answer"). ChatGPT performance was poor for discussion of benefits and alternatives for all surgical procedures, with some responses being inaccurate. The mean Flesch-Kinkaid Grade Level score for all responses was 17.5 (SD, 2.1), corresponding to a postgraduate reading level.
Conclusions: Overall, ChatGPT generated accurate responses to questions about surgical informed consent. However, it produced clearly false portions of responses, highlighting the need for a careful review of responses by qualified health care professionals.