Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.
Nadia C W Kamminga, June E C Kievits, Peter W Plaisier, Jako S Burgers, Astrid M van der Veldt, Jan A G J van den Brand, Mark Mulder, Marlies Wakkee, Marjolein Lugtenberg, Tamar Nijsten
{"title":"Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.","authors":"Nadia C W Kamminga, June E C Kievits, Peter W Plaisier, Jako S Burgers, Astrid M van der Veldt, Jan A G J van den Brand, Mark Mulder, Marlies Wakkee, Marjolein Lugtenberg, Tamar Nijsten","doi":"10.1093/bjd/ljae377","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have a potential role in providing adequate patient information.</p><p><strong>Objectives: </strong>To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.</p><p><strong>Methods: </strong>Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.</p><p><strong>Results: </strong>Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.</p><p><strong>Conclusions: </strong>Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.</p>","PeriodicalId":9238,"journal":{"name":"British Journal of Dermatology","volume":" ","pages":"306-315"},"PeriodicalIF":11.0000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Dermatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/bjd/ljae377","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DERMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) have a potential role in providing adequate patient information.
Objectives: To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.
Methods: Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.
Results: Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.
Conclusions: Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.
期刊介绍:
The British Journal of Dermatology (BJD) is committed to publishing the highest quality dermatological research. Through its publications, the journal seeks to advance the understanding, management, and treatment of skin diseases, ultimately aiming to improve patient outcomes.