Can large language models fully automate or partially assist paper selection in systematic reviews?

IF 3.7 2区医学 Q1 OPHTHALMOLOGY British Journal of Ophthalmology Pub Date : 2025-01-15 DOI:10.1136/bjo-2024-326254

Haichao Chen, Zehua Jiang, Xinyu Liu, Can Can Xue, Samantha Min Er Yew, Bin Sheng, Ying-Feng Zheng, Xiaofei Wang, You Wu, Sobha Sivaprasad, Tien Yin Wong, Varun Chaudhary, Yih Chung Tham

{"title":"Can large language models fully automate or partially assist paper selection in systematic reviews?","authors":"Haichao Chen, Zehua Jiang, Xinyu Liu, Can Can Xue, Samantha Min Er Yew, Bin Sheng, Ying-Feng Zheng, Xiaofei Wang, You Wu, Sobha Sivaprasad, Tien Yin Wong, Varun Chaudhary, Yih Chung Tham","doi":"10.1136/bjo-2024-326254","DOIUrl":null,"url":null,"abstract":"Background/aims Large language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail. Methods We introduced two LLM-based approaches of systematic review: an LLM-enabled fully automated approach (LLM-FA) utilising three different GPT-4 plugins (Consensus GPT, Scholar GPT and GPT web browsing modes) and an LLM-facilitated semi-automated approach (LLM-SA) using GPT4’s Application Programming Interface (API). We benchmarked these approaches using three published systematic reviews that reported the prevalence of diabetic retinopathy across different populations (general population, pregnant women and children). Results The three published reviews consisted of 98 papers in total. Across these three reviews, in the LLM-FA approach, Consensus GPT correctly identified 32.7% (32 out of 98) of papers, while Scholar GPT and GPT4’s web browsing modes only identified 19.4% (19 out of 98) and 6.1% (6 out of 98), respectively. On the other hand, the LLM-SA approach not only successfully included 82.7% (81 out of 98) of these papers but also correctly excluded 92.2% of 4497 irrelevant papers. Conclusions Our findings suggest LLMs are not yet capable of autonomously identifying and selecting relevant papers in systematic reviews. However, they hold promise as an assistive tool to improve the efficiency of the paper selection process in systematic reviews. Data are available upon reasonable request. All data and code are available upon request by emailing thamyc@nus.edu.sg.","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":"28 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2024-326254","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background/aims Large language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail. Methods We introduced two LLM-based approaches of systematic review: an LLM-enabled fully automated approach (LLM-FA) utilising three different GPT-4 plugins (Consensus GPT, Scholar GPT and GPT web browsing modes) and an LLM-facilitated semi-automated approach (LLM-SA) using GPT4’s Application Programming Interface (API). We benchmarked these approaches using three published systematic reviews that reported the prevalence of diabetic retinopathy across different populations (general population, pregnant women and children). Results The three published reviews consisted of 98 papers in total. Across these three reviews, in the LLM-FA approach, Consensus GPT correctly identified 32.7% (32 out of 98) of papers, while Scholar GPT and GPT4’s web browsing modes only identified 19.4% (19 out of 98) and 6.1% (6 out of 98), respectively. On the other hand, the LLM-SA approach not only successfully included 82.7% (81 out of 98) of these papers but also correctly excluded 92.2% of 4497 irrelevant papers. Conclusions Our findings suggest LLMs are not yet capable of autonomously identifying and selecting relevant papers in systematic reviews. However, they hold promise as an assistive tool to improve the efficiency of the paper selection process in systematic reviews. Data are available upon reasonable request. All data and code are available upon request by emailing thamyc@nus.edu.sg.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型能否完全自动化或部分辅助系统评审中的论文选择？

背景/目的大型语言模型（llm）在提高学术研究效率方面具有巨大的潜力。法学硕士在系统评价中的准确性和表现，作为证据构建的核心部分，还有待详细研究。我们介绍了两种基于llm的系统评价方法：一种是基于llm的全自动方法（LLM-FA），利用三种不同的GPT-4插件（Consensus GPT、Scholar GPT和GPT网页浏览模式），另一种是基于llm的半自动方法（LLM-SA），使用GPT4的应用程序编程接口（API）。我们使用三篇已发表的系统综述来对这些方法进行基准测试，这些综述报道了糖尿病视网膜病变在不同人群（普通人群、孕妇和儿童）中的患病率。结果3篇综述共收录论文98篇。在这三篇综述中，在LLM-FA方法中，Consensus GPT正确识别了32.7%（98篇中的32篇）的论文，而Scholar GPT和GPT4的网页浏览模式分别仅识别了19.4%（98篇中的19篇）和6.1%（98篇中的6篇）。另一方面，LLM-SA方法不仅成功地收录了这些论文中的82.7%（98篇中的81篇），而且正确地排除了4497篇无关论文中的92.2%。我们的研究结果表明，法学硕士还没有能力在系统综述中自主识别和选择相关论文。然而，它们有望作为一种辅助工具来提高系统评价中论文选择过程的效率。如有合理要求，可提供资料。所有数据和代码可通过电子邮件thamyc@nus.edu.sg索取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

British Journal of Ophthalmology 医学-眼科学

CiteScore

10.30

自引率

2.40%

发文量

213

审稿时长

3-6 weeks

期刊介绍： The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.