Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.

IF 15.2 1区 医学 Q1 MEDICINE, GENERAL & INTERNAL Annals of Internal Medicine Pub Date : 2025-03-01 Epub Date: 2025-02-25 DOI:10.7326/ANNALS-24-02189
Christian Cao, Jason Sang, Rohit Arora, David Chen, Robert Kloosterman, Matthew Cecere, Jaswanth Gorla, Richard Saleh, Ian Drennan, Bijan Teja, Michael Fehlings, Paul Ronksley, Alexander A Leung, Dany E Weisz, Harriet Ware, Mairead Whelan, David B Emerson, Rahul K Arora, Niklas Bobrovitz
{"title":"Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.","authors":"Christian Cao, Jason Sang, Rohit Arora, David Chen, Robert Kloosterman, Matthew Cecere, Jaswanth Gorla, Richard Saleh, Ian Drennan, Bijan Teja, Michael Fehlings, Paul Ronksley, Alexander A Leung, Dany E Weisz, Harriet Ware, Mairead Whelan, David B Emerson, Rahul K Arora, Niklas Bobrovitz","doi":"10.7326/ANNALS-24-02189","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.</p><p><strong>Objective: </strong>To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.</p><p><strong>Design: </strong>Diagnostic test accuracy.</p><p><strong>Setting: </strong>48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).</p><p><strong>Participants: </strong>None.</p><p><strong>Measurements: </strong>Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).</p><p><strong>Results: </strong>Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.</p><p><strong>Limitations: </strong>Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.</p><p><strong>Conclusion: </strong>A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.</p><p><strong>Primary funding source: </strong>None.</p>","PeriodicalId":7932,"journal":{"name":"Annals of Internal Medicine","volume":" ","pages":"389-401"},"PeriodicalIF":15.2000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Internal Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.7326/ANNALS-24-02189","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.

Objective: To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.

Design: Diagnostic test accuracy.

Setting: 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).

Participants: None.

Measurements: Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).

Results: Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.

Limitations: Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.

Conclusion: A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.

Primary funding source: None.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为系统性综述中的大语言模型驱动筛选开发提示模板。
背景:系统评价(SRs)受到最初严格的文章筛选的阻碍,这延迟了获得可靠的信息合成。目的:开发适用于不同综述的大语言模型(LLM)驱动摘要和全文筛选的通用提示模板。设计:诊断测试的准确性。设置:对10个ssr的48425篇引文进行摘要筛选测试。全文筛选评估了原始搜索中所有12 690篇免费文章。快速开发使用gpt4 -0125预览模型(OpenAI)。参与者:没有。测量:大型语言模型被提示包含或排除基于SR资格标准的文章。在全文筛选后,将模型输出与原始SR作者决策进行比较,以评估其性能(准确性、灵敏度和特异性)。结果:使用GPT4-0125-preview优化后的提示在摘要筛选中加权灵敏度为97.7%(范围为86.7% ~ 100%),特异性为85.2%(范围为68.3% ~ 95.9%);在全文筛选中加权灵敏度为96.5%(范围为89.7% ~ 100.0%),特异性为91.2%(范围为80.7% ~ 100%)。相比之下,零射击提示的灵敏度较低(49.0%的摘要,49.1%的全文)。在llm中,Claude-3.5 (Anthropic)和GPT4变体具有相似的性能,而Gemini Pro(谷歌)和GPT3.5 (OpenAI)模型表现不佳。10000条引文的直接筛选成本有很大不同:单个人摘要筛选估计需要超过83小时和1666.67美元,而我们基于法学硕士的方法在1天内完成筛选,成本为157.02美元。限制:可能存在进一步的提示优化。回顾性研究。sr的方便样本。全文筛选评估仅限于免费的PubMed Central全文文章。结论:开发了一种具有高灵敏度和特异性的摘要和全文筛选通用提示符,可适用于其他SRs和llm。我们的创新可能对SR研究人员和在医学科学中开展类似的基于标准的任务的研究人员有价值。主要资金来源:无。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Annals of Internal Medicine
Annals of Internal Medicine 医学-医学:内科
CiteScore
23.90
自引率
1.80%
发文量
1136
审稿时长
3-8 weeks
期刊介绍: Established in 1927 by the American College of Physicians (ACP), Annals of Internal Medicine is the premier internal medicine journal. Annals of Internal Medicine’s mission is to promote excellence in medicine, enable physicians and other health care professionals to be well informed members of the medical community and society, advance standards in the conduct and reporting of medical research, and contribute to improving the health of people worldwide. To achieve this mission, the journal publishes a wide variety of original research, review articles, practice guidelines, and commentary relevant to clinical practice, health care delivery, public health, health care policy, medical education, ethics, and research methodology. In addition, the journal publishes personal narratives that convey the feeling and the art of medicine.
期刊最新文献
In hypercholesterolemia, adding inclisiran to individually optimized lipid-lowering therapy improved LDL-C levels at 90 d. In older adults, RSV prefusion F vaccine reduced hospitalization for RSV-related respiratory tract disease vs. no vaccine. In adults at high risk for ventricular arrhythmia, treatment to increase potassium to high-normal levels improved a composite outcome. Q&A: Past challenges prop up today's politics around gender-affirming care. In symptomatic AF, nurse-led pre-ablation lifestyle treatment reduced repeat ablations or cardioversions at 1 y.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1