Hannah Labinsky, Lea-Kristin Nagler, Martin Krusche, Sebastian Griewing, Peer Aries, Anja Kroiß, Patrick-Pascal Strunz, Sebastian Kuhn, Marc Schmalzing, Michael Gernert, Johannes Knitza
{"title":"Vignette-based comparative analysis of ChatGPT and specialist treatment decisions for rheumatic patients: results of the Rheum2Guide study.","authors":"Hannah Labinsky, Lea-Kristin Nagler, Martin Krusche, Sebastian Griewing, Peer Aries, Anja Kroiß, Patrick-Pascal Strunz, Sebastian Kuhn, Marc Schmalzing, Michael Gernert, Johannes Knitza","doi":"10.1007/s00296-024-05675-5","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The complex nature of rheumatic diseases poses considerable challenges for clinicians when developing individualized treatment plans. Large language models (LLMs) such as ChatGPT could enable treatment decision support.</p><p><strong>Objective: </strong>To compare treatment plans generated by ChatGPT-3.5 and GPT-4 to those of a clinical rheumatology board (RB).</p><p><strong>Design/methods: </strong>Fictional patient vignettes were created and GPT-3.5, GPT-4, and the RB were queried to provide respective first- and second-line treatment plans with underlying justifications. Four rheumatologists from different centers, blinded to the origin of treatment plans, selected the overall preferred treatment concept and assessed treatment plans' safety, EULAR guideline adherence, medical adequacy, overall quality, justification of the treatment plans and their completeness as well as patient vignette difficulty using a 5-point Likert scale.</p><p><strong>Results: </strong>20 fictional vignettes covering various rheumatic diseases and varying difficulty levels were assembled and a total of 160 ratings were assessed. In 68.8% (110/160) of cases, raters preferred the RB's treatment plans over those generated by GPT-4 (16.3%; 26/160) and GPT-3.5 (15.0%; 24/160). GPT-4's plans were chosen more frequently for first-line treatments compared to GPT-3.5. No significant safety differences were observed between RB and GPT-4's first-line treatment plans. Rheumatologists' plans received significantly higher ratings in guideline adherence, medical appropriateness, completeness and overall quality. Ratings did not correlate with the vignette difficulty. LLM-generated plans were notably longer and more detailed.</p><p><strong>Conclusion: </strong>GPT-4 and GPT-3.5 generated safe, high-quality treatment plans for rheumatic diseases, demonstrating promise in clinical decision support. Future research should investigate detailed standardized prompts and the impact of LLM usage on clinical decisions.</p>","PeriodicalId":21322,"journal":{"name":"Rheumatology International","volume":" ","pages":"2043-2053"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11392980/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Rheumatology International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00296-024-05675-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/10 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The complex nature of rheumatic diseases poses considerable challenges for clinicians when developing individualized treatment plans. Large language models (LLMs) such as ChatGPT could enable treatment decision support.
Objective: To compare treatment plans generated by ChatGPT-3.5 and GPT-4 to those of a clinical rheumatology board (RB).
Design/methods: Fictional patient vignettes were created and GPT-3.5, GPT-4, and the RB were queried to provide respective first- and second-line treatment plans with underlying justifications. Four rheumatologists from different centers, blinded to the origin of treatment plans, selected the overall preferred treatment concept and assessed treatment plans' safety, EULAR guideline adherence, medical adequacy, overall quality, justification of the treatment plans and their completeness as well as patient vignette difficulty using a 5-point Likert scale.
Results: 20 fictional vignettes covering various rheumatic diseases and varying difficulty levels were assembled and a total of 160 ratings were assessed. In 68.8% (110/160) of cases, raters preferred the RB's treatment plans over those generated by GPT-4 (16.3%; 26/160) and GPT-3.5 (15.0%; 24/160). GPT-4's plans were chosen more frequently for first-line treatments compared to GPT-3.5. No significant safety differences were observed between RB and GPT-4's first-line treatment plans. Rheumatologists' plans received significantly higher ratings in guideline adherence, medical appropriateness, completeness and overall quality. Ratings did not correlate with the vignette difficulty. LLM-generated plans were notably longer and more detailed.
Conclusion: GPT-4 and GPT-3.5 generated safe, high-quality treatment plans for rheumatic diseases, demonstrating promise in clinical decision support. Future research should investigate detailed standardized prompts and the impact of LLM usage on clinical decisions.
期刊介绍:
RHEUMATOLOGY INTERNATIONAL is an independent journal reflecting world-wide progress in the research, diagnosis and treatment of the various rheumatic diseases. It is designed to serve researchers and clinicians in the field of rheumatology.
RHEUMATOLOGY INTERNATIONAL will cover all modern trends in clinical research as well as in the management of rheumatic diseases. Special emphasis will be given to public health issues related to rheumatic diseases, applying rheumatology research to clinical practice, epidemiology of rheumatic diseases, diagnostic tests for rheumatic diseases, patient reported outcomes (PROs) in rheumatology and evidence on education of rheumatology. Contributions to these topics will appear in the form of original publications, short communications, editorials, and reviews. "Letters to the editor" will be welcome as an enhancement to discussion. Basic science research, including in vitro or animal studies, is discouraged to submit, as we will only review studies on humans with an epidemological or clinical perspective. Case reports without a proper review of the literatura (Case-based Reviews) will not be published. Every effort will be made to ensure speed of publication while maintaining a high standard of contents and production.
Manuscripts submitted for publication must contain a statement to the effect that all human studies have been reviewed by the appropriate ethics committee and have therefore been performed in accordance with the ethical standards laid down in an appropriate version of the 1964 Declaration of Helsinki. It should also be stated clearly in the text that all persons gave their informed consent prior to their inclusion in the study. Details that might disclose the identity of the subjects under study should be omitted.