大型语言模型能否在系统综述中取代人类?评估 GPT-4 从多语种同行评审和灰色文献中筛选和提取数据的功效。

IF 5 2区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Research Synthesis Methods Pub Date : 2024-03-14 DOI:10.1002/jrsm.1715
Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, Kristin Hadfield
{"title":"大型语言模型能否在系统综述中取代人类?评估 GPT-4 从多语种同行评审和灰色文献中筛选和提取数据的功效。","authors":"Qusai Khraisha,&nbsp;Sophie Put,&nbsp;Johanna Kappenberg,&nbsp;Azza Warraitch,&nbsp;Kristin Hadfield","doi":"10.1002/jrsm.1715","DOIUrl":null,"url":null,"abstract":"<p>Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a “human-out-of-the-loop” approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching “human-like” levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"616-626"},"PeriodicalIF":5.0000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1715","citationCount":"0","resultStr":"{\"title\":\"Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages\",\"authors\":\"Qusai Khraisha,&nbsp;Sophie Put,&nbsp;Johanna Kappenberg,&nbsp;Azza Warraitch,&nbsp;Kristin Hadfield\",\"doi\":\"10.1002/jrsm.1715\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a “human-out-of-the-loop” approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching “human-like” levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.</p>\",\"PeriodicalId\":226,\"journal\":{\"name\":\"Research Synthesis Methods\",\"volume\":\"15 4\",\"pages\":\"616-626\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1715\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research Synthesis Methods\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/jrsm.1715\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Synthesis Methods","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jrsm.1715","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

系统性综述对指导实践、研究和政策至关重要,但其速度往往很慢,而且需要大量人力。大型语言模型(LLMs)可以加快系统性综述的速度并使其自动化,但它们在此类任务中的表现还有待于与人类进行全面评估,而且迄今为止还没有任何研究对最大的 LLM--生成式预训练转换器(GPT)-4 进行过测试。这项预先注册的研究采用了 "人出回路 "的方法,评估 GPT-4 在标题/摘要筛选、全文审阅以及跨各种文献类型和语言的数据提取方面的能力。虽然 GPT-4 在某些任务中的准确性与人类表现相当,但偶然的一致和数据集的不平衡使结果出现偏差。对这些因素进行调整后,所有阶段的性能得分都有所下降:在数据提取方面,性能为中等;在筛选方面,从高度平衡的文献数据集(约为 1:1)中的零分到研究纳入与排除比例失调的数据集(约为 1:3)中的中等分不等。在使用高度可靠的提示筛选全文文献时,GPT-4 的性能更加稳定,达到了 "类似人类 "的水平。尽管我们的研究结果表明,目前在使用 LLM 进行系统性综述时应非常谨慎,但这些研究结果也提供了初步证据,表明对于在特定条件下执行的某些综述任务,LLM 的表现可以与人类相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a “human-out-of-the-loop” approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching “human-like” levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Research Synthesis Methods
Research Synthesis Methods MATHEMATICAL & COMPUTATIONAL BIOLOGYMULTID-MULTIDISCIPLINARY SCIENCES
CiteScore
16.90
自引率
3.10%
发文量
75
期刊介绍: Research Synthesis Methods is a reputable, peer-reviewed journal that focuses on the development and dissemination of methods for conducting systematic research synthesis. Our aim is to advance the knowledge and application of research synthesis methods across various disciplines. Our journal provides a platform for the exchange of ideas and knowledge related to designing, conducting, analyzing, interpreting, reporting, and applying research synthesis. While research synthesis is commonly practiced in the health and social sciences, our journal also welcomes contributions from other fields to enrich the methodologies employed in research synthesis across scientific disciplines. By bridging different disciplines, we aim to foster collaboration and cross-fertilization of ideas, ultimately enhancing the quality and effectiveness of research synthesis methods. Whether you are a researcher, practitioner, or stakeholder involved in research synthesis, our journal strives to offer valuable insights and practical guidance for your work.
期刊最新文献
Issue Information A tutorial on aggregating evidence from conceptual replication studies using the product Bayes factor Evolving use of the Cochrane Risk of Bias 2 tool in biomedical systematic reviews Exploring methodological approaches used in network meta-analysis of psychological interventions: A scoping review An evaluation of the performance of stopping rules in AI-aided screening for psychological meta-analytical research
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1