大语言模型辅助临床试验筛选

Jacob Beattie, Dylan Owens, Ann Marie Navar, Luiza Giuliani Schmitt, Kimberly Taing, Sarah Neufeld, Daniel Yang, Christian Chukwuma, Ahmed Gul, Dong Soo Lee, Neil Desai, Dominic Moon, Jing Wang, Steve Jiang, Michael Dohopolski
{"title":"大语言模型辅助临床试验筛选","authors":"Jacob Beattie, Dylan Owens, Ann Marie Navar, Luiza Giuliani Schmitt, Kimberly Taing, Sarah Neufeld, Daniel Yang, Christian Chukwuma, Ahmed Gul, Dong Soo Lee, Neil Desai, Dominic Moon, Jing Wang, Steve Jiang, Michael Dohopolski","doi":"10.1101/2024.08.27.24312646","DOIUrl":null,"url":null,"abstract":"Purpose: Identifying potential participants for clinical trials using traditional manual screening methods is time-consuming and expensive. Structured data in electronic health records (EHR) are often insufficient to capture trial inclusion and exclusion criteria adequately. Large language models (LLMs) offer the potential for improved participant screening by searching text notes in the EHR, but optimal deployment strategies remain unclear.\nMethods: We evaluated the performance of GPT-3.5 and GPT-4 in screening a cohort of 74 patients (35 eligible, 39 ineligible) using EHR data, including progress notes, pathology reports, and imaging reports, for a phase 2 clinical trial in patients with head and neck cancer. Fourteen trial criteria were evaluated, including stage, histology, prior treatments, underlying conditions, functional status, etc. Manually annotated data served as the ground truth. We tested three prompting approaches (Structured Output (SO), Chain of Thought (CoT), and Self-Discover (SD)). SO and CoT were further tested using expert and LLM guidance (EG and LLM-G, respectively). Prompts were developed and refined using 10 patients from each cohort and then assessed on the remaining 54 patients. Each approach was assessed for accuracy, sensitivity, specificity, and micro F1 score. We explored two eligibility predictions: strict eligibility required meeting all criteria, while proportional eligibility used the proportion of criteria met. Screening time and cost were measured, and a failure analysis identified common misclassification issues. Results: Fifty-four patients were evaluated (25 enrolled, 29 not enrolled). At the criterion level, GPT-3.5 showed a median accuracy of 0.761 (range: 0.554-0.910), with the Structured Out- put + EG approach performing best. GPT-4 demonstrated a median accuracy of 0.838 (range: 0.758-0.886), with the Self-Discover approach achieving the highest Youden Index of 0.729. For strict patient-level eligibility, GPT-3.5's Structured Output + EG approach reached an accuracy of 0.611, while GPT-4's CoT + EG achieved 0.65. Proportional eligibility performed better over- all, with GPT-4's CoT + LLM-G approach having the highest AUC (0.82) and Youden Index (0.60). Screening times ranged from 1.4 to 3 minutes per patient for GPT-3.5 and 7.9 to 12.4 minutes for GPT-4, with costs of $0.02-$0.03 for GPT-3.5 and $0.15-$0.27 for GPT-4.\nConclusion: LLMs can be used to identify specific clinical trial criteria but had difficulties identifying patients who met all criteria. Instead, using the proportion of criteria met to flag candidates for manual review maybe a more practical approach. LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large Language Model Augmented Clinical Trial Screening\",\"authors\":\"Jacob Beattie, Dylan Owens, Ann Marie Navar, Luiza Giuliani Schmitt, Kimberly Taing, Sarah Neufeld, Daniel Yang, Christian Chukwuma, Ahmed Gul, Dong Soo Lee, Neil Desai, Dominic Moon, Jing Wang, Steve Jiang, Michael Dohopolski\",\"doi\":\"10.1101/2024.08.27.24312646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: Identifying potential participants for clinical trials using traditional manual screening methods is time-consuming and expensive. Structured data in electronic health records (EHR) are often insufficient to capture trial inclusion and exclusion criteria adequately. Large language models (LLMs) offer the potential for improved participant screening by searching text notes in the EHR, but optimal deployment strategies remain unclear.\\nMethods: We evaluated the performance of GPT-3.5 and GPT-4 in screening a cohort of 74 patients (35 eligible, 39 ineligible) using EHR data, including progress notes, pathology reports, and imaging reports, for a phase 2 clinical trial in patients with head and neck cancer. Fourteen trial criteria were evaluated, including stage, histology, prior treatments, underlying conditions, functional status, etc. Manually annotated data served as the ground truth. We tested three prompting approaches (Structured Output (SO), Chain of Thought (CoT), and Self-Discover (SD)). SO and CoT were further tested using expert and LLM guidance (EG and LLM-G, respectively). Prompts were developed and refined using 10 patients from each cohort and then assessed on the remaining 54 patients. Each approach was assessed for accuracy, sensitivity, specificity, and micro F1 score. We explored two eligibility predictions: strict eligibility required meeting all criteria, while proportional eligibility used the proportion of criteria met. Screening time and cost were measured, and a failure analysis identified common misclassification issues. Results: Fifty-four patients were evaluated (25 enrolled, 29 not enrolled). At the criterion level, GPT-3.5 showed a median accuracy of 0.761 (range: 0.554-0.910), with the Structured Out- put + EG approach performing best. GPT-4 demonstrated a median accuracy of 0.838 (range: 0.758-0.886), with the Self-Discover approach achieving the highest Youden Index of 0.729. For strict patient-level eligibility, GPT-3.5's Structured Output + EG approach reached an accuracy of 0.611, while GPT-4's CoT + EG achieved 0.65. Proportional eligibility performed better over- all, with GPT-4's CoT + LLM-G approach having the highest AUC (0.82) and Youden Index (0.60). Screening times ranged from 1.4 to 3 minutes per patient for GPT-3.5 and 7.9 to 12.4 minutes for GPT-4, with costs of $0.02-$0.03 for GPT-3.5 and $0.15-$0.27 for GPT-4.\\nConclusion: LLMs can be used to identify specific clinical trial criteria but had difficulties identifying patients who met all criteria. Instead, using the proportion of criteria met to flag candidates for manual review maybe a more practical approach. LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials.\",\"PeriodicalId\":501454,\"journal\":{\"name\":\"medRxiv - Health Informatics\",\"volume\":\"22 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.27.24312646\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.27.24312646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目的:使用传统的人工筛选方法确定临床试验的潜在参与者既费时又费钱。电子病历(EHR)中的结构化数据往往不足以充分捕捉试验纳入和排除标准。大语言模型(LLM)通过搜索电子病历中的文本注释,为改进参与者筛选提供了可能,但最佳部署策略仍不明确:我们评估了 GPT-3.5 和 GPT-4 在筛选 74 名患者(35 名符合条件,39 名不符合条件)时的性能,筛选时使用的是头颈癌患者 2 期临床试验的 EHR 数据,包括进展记录、病理报告和成像报告。对 14 项试验标准进行了评估,包括分期、组织学、既往治疗、基础疾病、功能状态等。人工标注的数据作为基本事实。我们测试了三种提示方法(结构化输出(SO)、思维链(CoT)和自我发现(SD))。在专家和 LLM 的指导下(分别为 EG 和 LLM-G),我们进一步测试了 SO 和 CoT。在每个队列中挑选 10 名患者对提示进行开发和改进,然后对剩余的 54 名患者进行评估。我们对每种方法的准确性、灵敏度、特异性和微观 F1 分数进行了评估。我们探讨了两种资格预测:严格资格要求符合所有标准,而比例资格则使用符合标准的比例。我们对筛查时间和成本进行了衡量,并通过失败分析找出了常见的错误分类问题。结果:对 54 名患者进行了评估(25 人入选,29 人未获入选)。在标准水平上,GPT-3.5 的中位准确率为 0.761(范围:0.554-0.910),其中结构化外推 + EG 方法表现最佳。GPT-4 的中位准确度为 0.838(范围:0.758-0.886),其中自我发现法的尤登指数最高,为 0.729。对于严格的患者级别资格审查,GPT-3.5 的结构化输出 + EG 方法达到了 0.611 的准确率,而 GPT-4 的 CoT + EG 达到了 0.65。比例资格筛选的总体表现更好,GPT-4 的 CoT + LLM-G 方法的 AUC(0.82)和尤登指数(0.60)最高。GPT-3.5 的每位患者筛查时间为 1.4 到 3 分钟,GPT-4 为 7.9 到 12.4 分钟,GPT-3.5 的成本为 0.02 到 0.03 美元,GPT-4 为 0.15 到 0.27 美元:结论:LLM 可用于识别特定的临床试验标准,但难以识别符合所有标准的患者。相反,使用符合标准的比例来标记需要人工审核的候选者可能是一种更实用的方法。LLM 的性能因提示而异,GPT-4 普遍优于 GPT-3.5,但成本更高,处理时间更长。在将患者与临床试验相匹配方面,LLM 应该是人工病历审核的补充,而不是替代。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Large Language Model Augmented Clinical Trial Screening
Purpose: Identifying potential participants for clinical trials using traditional manual screening methods is time-consuming and expensive. Structured data in electronic health records (EHR) are often insufficient to capture trial inclusion and exclusion criteria adequately. Large language models (LLMs) offer the potential for improved participant screening by searching text notes in the EHR, but optimal deployment strategies remain unclear. Methods: We evaluated the performance of GPT-3.5 and GPT-4 in screening a cohort of 74 patients (35 eligible, 39 ineligible) using EHR data, including progress notes, pathology reports, and imaging reports, for a phase 2 clinical trial in patients with head and neck cancer. Fourteen trial criteria were evaluated, including stage, histology, prior treatments, underlying conditions, functional status, etc. Manually annotated data served as the ground truth. We tested three prompting approaches (Structured Output (SO), Chain of Thought (CoT), and Self-Discover (SD)). SO and CoT were further tested using expert and LLM guidance (EG and LLM-G, respectively). Prompts were developed and refined using 10 patients from each cohort and then assessed on the remaining 54 patients. Each approach was assessed for accuracy, sensitivity, specificity, and micro F1 score. We explored two eligibility predictions: strict eligibility required meeting all criteria, while proportional eligibility used the proportion of criteria met. Screening time and cost were measured, and a failure analysis identified common misclassification issues. Results: Fifty-four patients were evaluated (25 enrolled, 29 not enrolled). At the criterion level, GPT-3.5 showed a median accuracy of 0.761 (range: 0.554-0.910), with the Structured Out- put + EG approach performing best. GPT-4 demonstrated a median accuracy of 0.838 (range: 0.758-0.886), with the Self-Discover approach achieving the highest Youden Index of 0.729. For strict patient-level eligibility, GPT-3.5's Structured Output + EG approach reached an accuracy of 0.611, while GPT-4's CoT + EG achieved 0.65. Proportional eligibility performed better over- all, with GPT-4's CoT + LLM-G approach having the highest AUC (0.82) and Youden Index (0.60). Screening times ranged from 1.4 to 3 minutes per patient for GPT-3.5 and 7.9 to 12.4 minutes for GPT-4, with costs of $0.02-$0.03 for GPT-3.5 and $0.15-$0.27 for GPT-4. Conclusion: LLMs can be used to identify specific clinical trial criteria but had difficulties identifying patients who met all criteria. Instead, using the proportion of criteria met to flag candidates for manual review maybe a more practical approach. LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A case is not a case is not a case - challenges and solutions in determining urolithiasis caseloads using the digital infrastructure of a clinical data warehouse Reliable Online Auditory Cognitive Testing: An observational study Federated Multiple Imputation for Variables that Are Missing Not At Random in Distributed Electronic Health Records Characterizing the connection between Parkinson's disease progression and healthcare utilization Generative AI and Large Language Models in Reducing Medication Related Harm and Adverse Drug Events - A Scoping Review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1