Christian Cao, Jason Sang, Rohit Arora, David Chen, Robert Kloosterman, Matthew Cecere, Jaswanth Gorla, Richard Saleh, Ian Drennan, Bijan Teja, Michael Fehlings, Paul Ronksley, Alexander A Leung, Dany E Weisz, Harriet Ware, Mairead Whelan, David B Emerson, Rahul K Arora, Niklas Bobrovitz
{"title":"为系统性综述中的大语言模型驱动筛选开发提示模板。","authors":"Christian Cao, Jason Sang, Rohit Arora, David Chen, Robert Kloosterman, Matthew Cecere, Jaswanth Gorla, Richard Saleh, Ian Drennan, Bijan Teja, Michael Fehlings, Paul Ronksley, Alexander A Leung, Dany E Weisz, Harriet Ware, Mairead Whelan, David B Emerson, Rahul K Arora, Niklas Bobrovitz","doi":"10.7326/ANNALS-24-02189","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.</p><p><strong>Objective: </strong>To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.</p><p><strong>Design: </strong>Diagnostic test accuracy.</p><p><strong>Setting: </strong>48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).</p><p><strong>Participants: </strong>None.</p><p><strong>Measurements: </strong>Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).</p><p><strong>Results: </strong>Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.</p><p><strong>Limitations: </strong>Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.</p><p><strong>Conclusion: </strong>A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.</p><p><strong>Primary funding source: </strong>None.</p>","PeriodicalId":7932,"journal":{"name":"Annals of Internal Medicine","volume":" ","pages":""},"PeriodicalIF":19.6000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.\",\"authors\":\"Christian Cao, Jason Sang, Rohit Arora, David Chen, Robert Kloosterman, Matthew Cecere, Jaswanth Gorla, Richard Saleh, Ian Drennan, Bijan Teja, Michael Fehlings, Paul Ronksley, Alexander A Leung, Dany E Weisz, Harriet Ware, Mairead Whelan, David B Emerson, Rahul K Arora, Niklas Bobrovitz\",\"doi\":\"10.7326/ANNALS-24-02189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.</p><p><strong>Objective: </strong>To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.</p><p><strong>Design: </strong>Diagnostic test accuracy.</p><p><strong>Setting: </strong>48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).</p><p><strong>Participants: </strong>None.</p><p><strong>Measurements: </strong>Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).</p><p><strong>Results: </strong>Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.</p><p><strong>Limitations: </strong>Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.</p><p><strong>Conclusion: </strong>A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.</p><p><strong>Primary funding source: </strong>None.</p>\",\"PeriodicalId\":7932,\"journal\":{\"name\":\"Annals of Internal Medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":19.6000,\"publicationDate\":\"2025-02-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Internal Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.7326/ANNALS-24-02189\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Internal Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.7326/ANNALS-24-02189","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.
Background: Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.
Objective: To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.
Design: Diagnostic test accuracy.
Setting: 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).
Participants: None.
Measurements: Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).
Results: Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.
Limitations: Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.
Conclusion: A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.
期刊介绍:
Established in 1927 by the American College of Physicians (ACP), Annals of Internal Medicine is the premier internal medicine journal. Annals of Internal Medicine’s mission is to promote excellence in medicine, enable physicians and other health care professionals to be well informed members of the medical community and society, advance standards in the conduct and reporting of medical research, and contribute to improving the health of people worldwide. To achieve this mission, the journal publishes a wide variety of original research, review articles, practice guidelines, and commentary relevant to clinical practice, health care delivery, public health, health care policy, medical education, ethics, and research methodology. In addition, the journal publishes personal narratives that convey the feeling and the art of medicine.