Vignette-based comparative analysis of ChatGPT and specialist treatment decisions for rheumatic patients: results of the Rheum2Guide study.

IF 2.9 3区医学 Q2 RHEUMATOLOGY Rheumatology International Pub Date : 2024-10-01 Epub Date: 2024-08-10 DOI:10.1007/s00296-024-05675-5

Hannah Labinsky, Lea-Kristin Nagler, Martin Krusche, Sebastian Griewing, Peer Aries, Anja Kroiß, Patrick-Pascal Strunz, Sebastian Kuhn, Marc Schmalzing, Michael Gernert, Johannes Knitza

{"title":"Vignette-based comparative analysis of ChatGPT and specialist treatment decisions for rheumatic patients: results of the Rheum2Guide study.","authors":"Hannah Labinsky, Lea-Kristin Nagler, Martin Krusche, Sebastian Griewing, Peer Aries, Anja Kroiß, Patrick-Pascal Strunz, Sebastian Kuhn, Marc Schmalzing, Michael Gernert, Johannes Knitza","doi":"10.1007/s00296-024-05675-5","DOIUrl":null,"url":null,"abstract":"Background: The complex nature of rheumatic diseases poses considerable challenges for clinicians when developing individualized treatment plans. Large language models (LLMs) such as ChatGPT could enable treatment decision support.Objective: To compare treatment plans generated by ChatGPT-3.5 and GPT-4 to those of a clinical rheumatology board (RB).Design/methods: Fictional patient vignettes were created and GPT-3.5, GPT-4, and the RB were queried to provide respective first- and second-line treatment plans with underlying justifications. Four rheumatologists from different centers, blinded to the origin of treatment plans, selected the overall preferred treatment concept and assessed treatment plans' safety, EULAR guideline adherence, medical adequacy, overall quality, justification of the treatment plans and their completeness as well as patient vignette difficulty using a 5-point Likert scale.Results: 20 fictional vignettes covering various rheumatic diseases and varying difficulty levels were assembled and a total of 160 ratings were assessed. In 68.8% (110/160) of cases, raters preferred the RB's treatment plans over those generated by GPT-4 (16.3%; 26/160) and GPT-3.5 (15.0%; 24/160). GPT-4's plans were chosen more frequently for first-line treatments compared to GPT-3.5. No significant safety differences were observed between RB and GPT-4's first-line treatment plans. Rheumatologists' plans received significantly higher ratings in guideline adherence, medical appropriateness, completeness and overall quality. Ratings did not correlate with the vignette difficulty. LLM-generated plans were notably longer and more detailed.Conclusion: GPT-4 and GPT-3.5 generated safe, high-quality treatment plans for rheumatic diseases, demonstrating promise in clinical decision support. Future research should investigate detailed standardized prompts and the impact of LLM usage on clinical decisions.","PeriodicalId":21322,"journal":{"name":"Rheumatology International","volume":" ","pages":"2043-2053"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11392980/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Rheumatology International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00296-024-05675-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/10 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The complex nature of rheumatic diseases poses considerable challenges for clinicians when developing individualized treatment plans. Large language models (LLMs) such as ChatGPT could enable treatment decision support.

Objective: To compare treatment plans generated by ChatGPT-3.5 and GPT-4 to those of a clinical rheumatology board (RB).

Design/methods: Fictional patient vignettes were created and GPT-3.5, GPT-4, and the RB were queried to provide respective first- and second-line treatment plans with underlying justifications. Four rheumatologists from different centers, blinded to the origin of treatment plans, selected the overall preferred treatment concept and assessed treatment plans' safety, EULAR guideline adherence, medical adequacy, overall quality, justification of the treatment plans and their completeness as well as patient vignette difficulty using a 5-point Likert scale.

Results: 20 fictional vignettes covering various rheumatic diseases and varying difficulty levels were assembled and a total of 160 ratings were assessed. In 68.8% (110/160) of cases, raters preferred the RB's treatment plans over those generated by GPT-4 (16.3%; 26/160) and GPT-3.5 (15.0%; 24/160). GPT-4's plans were chosen more frequently for first-line treatments compared to GPT-3.5. No significant safety differences were observed between RB and GPT-4's first-line treatment plans. Rheumatologists' plans received significantly higher ratings in guideline adherence, medical appropriateness, completeness and overall quality. Ratings did not correlate with the vignette difficulty. LLM-generated plans were notably longer and more detailed.

Conclusion: GPT-4 and GPT-3.5 generated safe, high-quality treatment plans for rheumatic diseases, demonstrating promise in clinical decision support. Future research should investigate detailed standardized prompts and the impact of LLM usage on clinical decisions.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对风湿病患者的 ChatGPT 和专科治疗决策进行基于视频的比较分析：Rheum2Guide 研究的结果。

背景：风湿病的复杂性给临床医生制定个性化治疗方案带来了巨大挑战。大型语言模型（LLM），如 ChatGPT，可以为治疗决策提供支持：将 ChatGPT-3.5 和 GPT-4 生成的治疗方案与临床风湿病委员会（RB）的治疗方案进行比较：设计/方法：创建虚构的患者小故事，并询问 GPT-3.5、GPT-4 和 RB，以提供各自的一线和二线治疗方案及基本理由。来自不同中心的四位风湿病专家对治疗方案的来源进行了盲法处理，选出了总体首选治疗方案，并采用 5 点李克特量表对治疗方案的安全性、EULAR 指南的依从性、医疗充分性、总体质量、治疗方案的合理性和完整性以及患者小故事的难度进行了评估。结果：20 个虚构的小故事涵盖了各种风湿病，难度各不相同，共评估了 160 个评分。在68.8%（110/160）的病例中，评分者更喜欢RB的治疗方案，而不是GPT-4（16.3%；26/160）和GPT-3.5（15.0%；24/160）生成的方案。与 GPT-3.5 相比，GPT-4 的方案更常被选为一线治疗方案。在 RB 和 GPT-4 的一线治疗方案之间没有观察到明显的安全性差异。风湿病专家的方案在指南遵循性、医疗适宜性、完整性和总体质量方面的评分明显更高。评分与小故事难度无关。LLM生成的计划明显更长、更详细：GPT-4和GPT-3.5生成了安全、高质量的风湿病治疗计划，显示了临床决策支持的前景。未来的研究应调查详细的标准化提示以及使用 LLM 对临床决策的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Rheumatology International 医学-风湿病学

CiteScore

7.30

自引率

5.00%

发文量

191

审稿时长

16. months

期刊介绍： RHEUMATOLOGY INTERNATIONAL is an independent journal reflecting world-wide progress in the research, diagnosis and treatment of the various rheumatic diseases. It is designed to serve researchers and clinicians in the field of rheumatology. RHEUMATOLOGY INTERNATIONAL will cover all modern trends in clinical research as well as in the management of rheumatic diseases. Special emphasis will be given to public health issues related to rheumatic diseases, applying rheumatology research to clinical practice, epidemiology of rheumatic diseases, diagnostic tests for rheumatic diseases, patient reported outcomes (PROs) in rheumatology and evidence on education of rheumatology. Contributions to these topics will appear in the form of original publications, short communications, editorials, and reviews. "Letters to the editor" will be welcome as an enhancement to discussion. Basic science research, including in vitro or animal studies, is discouraged to submit, as we will only review studies on humans with an epidemological or clinical perspective. Case reports without a proper review of the literatura (Case-based Reviews) will not be published. Every effort will be made to ensure speed of publication while maintaining a high standard of contents and production. Manuscripts submitted for publication must contain a statement to the effect that all human studies have been reviewed by the appropriate ethics committee and have therefore been performed in accordance with the ethical standards laid down in an appropriate version of the 1964 Declaration of Helsinki. It should also be stated clearly in the text that all persons gave their informed consent prior to their inclusion in the study. Details that might disclose the identity of the subjects under study should be omitted.