Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh
{"title":"Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease?","authors":"Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh","doi":"10.1016/j.surg.2024.06.075","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.</p><p><strong>Methods: </strong>This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a \"thyroid + cancer\" or \"thyroid + disease\" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.</p><p><strong>Results: </strong>For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.</p><p><strong>Conclusions: </strong>Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.</p>","PeriodicalId":22152,"journal":{"name":"Surgery","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.surg.2024.06.075","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.
Methods: This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a "thyroid + cancer" or "thyroid + disease" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.
Results: For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.
Conclusions: Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.
期刊介绍:
For 66 years, Surgery has published practical, authoritative information about procedures, clinical advances, and major trends shaping general surgery. Each issue features original scientific contributions and clinical reports. Peer-reviewed articles cover topics in oncology, trauma, gastrointestinal, vascular, and transplantation surgery. The journal also publishes papers from the meetings of its sponsoring societies, the Society of University Surgeons, the Central Surgical Association, and the American Association of Endocrine Surgeons.