Ethan Ethan, Robert Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason Freed, Josephine A Cool, Zahir Kanjee, Kathleen Lane, Andrew S Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew PJ Olson, Jason Hom, Jonathan H. Chen, Adam Rodman
{"title":"Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial","authors":"Ethan Ethan, Robert Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason Freed, Josephine A Cool, Zahir Kanjee, Kathleen Lane, Andrew S Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew PJ Olson, Jason Hom, Jonathan H. Chen, Adam Rodman","doi":"10.1101/2024.08.05.24311485","DOIUrl":null,"url":null,"abstract":"Importance: Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.\nObjective: To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.\nDesign: Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.\nSetting: Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.\nParticipants: 92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine. Intervention: Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.\nMain Outcomes and Measures: The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.\nResults: Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).\nConclusions and Relevance: LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases. Trial Registration ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"369 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.05.24311485","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Importance: Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.
Objective: To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.
Design: Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.
Setting: Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.
Participants: 92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine. Intervention: Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.
Main Outcomes and Measures: The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.
Results: Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).
Conclusions and Relevance: LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases. Trial Registration ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423
重要性:大型语言模型(LLM)人工智能(AI)系统在诊断推理中大有可为,但其在没有明确正确答案的管理推理中的实用性尚不得而知:与传统资源相比,确定 LLM 辅助是否能提高医生在开放式管理推理任务中的表现:设计:2023年11月30日至2024年4月21日进行的前瞻性随机对照试验:来自斯坦福大学、贝斯以色列女执事医疗中心和弗吉尼亚大学的多机构研究,涉及美国各地的医生。参与者:92 名接受过内科、家庭医学或急诊医学培训的执业主治医师和住院医师。干预措施:五个由专家开发的临床病例小故事中包含多个开放式管理问题,以及通过德尔菲流程创建的评分标准。医生被随机分配在使用传统资源(如 UpToDate、Google)的同时通过 ChatGPT Plus 使用 GPT-4,或仅使用传统资源:主要结果是各组在专家开发的评分标准上的总分差异。次要结果包括特定领域得分和每个病例花费的时间:与使用传统资源的医生相比,使用 LLM 的医生得分更高(平均差异为 6.5%,95% CI 为 2.7-10.2,p<0.001)。在管理决策(6.1%,95% CI 2.5-9.7,p=0.001)、诊断决策(12.1%,95% CI 3.1-21.0,p=0.009)和特定病例(6.2%,95% CI 2.4-9.9,p=0.002)方面均有显著改善。GPT-4 用户在每个病例上花费的时间更长(平均差异 119.3 秒,95% CI 17.4-221.2,p=0.02)。GPT-4增强型医生与GPT-4单独型医生之间没有明显差异(-0.9%,95% CI -9.0至7.2,p=0.8):与传统资源相比,LLM 辅助提高了医生的管理推理能力,尤其是在针对具体情况和患者的决策方面。这些研究结果表明,LLM 可以增强复杂病例的管理决策。试验注册 ClinicalTrials.gov Identifier:NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423