首页 > 最新文献

Cochrane Evidence Synthesis and Methods最新文献

英文 中文
Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers 使用大型语言模型(chatgpt - 40)评估医学干预随机对照试验中的偏倚风险:与人类审稿人的审稿人一致
Pub Date : 2025-09-10 DOI: 10.1002/cesm.70048
Christopher James Rose, Julia Bidonde, Martin Ringsten, Julie Glanville, Thomas Potrebny, Chris Cooper, Ashley Elizabeth Muller, Hans Bugge Bergsund, Jose F. Meneses-Echavez, Rigmor C. Berg

Background

Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.

Methods

Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT-ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT-4o (February 2025) throughout.

Results

The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (p = 0.0015). Human-ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; p < 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; p < 0.001).

Conclusions

ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.

背景偏倚风险(RoB)评估是一项耗时且容易出现人为错误的高技能任务。RoB自动化工具以前使用的是使用相对较小的特定任务训练集构建的机器学习模型。大型语言模型(llm;例如,ChatGPT)是使用非任务特定的互联网规模训练集构建的复杂模型。它们表现出类似人类的能力,可能能够支持像RoB评估这样的任务。方法根据已发表的同行评议方案,我们随机抽取100篇Cochrane综述。评估医疗干预措施的新综述或更新综述,包括≥1个符合条件的试验,并使用Cochrane RoB1或RoB2进行人类共识评估,均符合条件。我们排除了在紧急情况下(例如COVID-19)进行的评价,以及有关公共卫生或福利的评价。我们从每个综述中随机抽取一个试验。采用个体随机或群随机设计的试验是合格的。我们从综述中提取了对试验的人类共识RoB评估,并从试验中提取了方法文本。我们使用25对回顾试验来开发ChatGPT提示,以使用试验方法文本评估RoB。我们使用提示和剩余的75个综述试验对来估计“总体RoB”(主要结局)和“随机化过程导致的RoB”的人类- chatgpt一致性,以及ChatGPT-ChatGPT(内部)“总体RoB”的一致性。我们自始至终使用的是chatgpt - 40(2025年2月)。结果75篇综述来自35个Cochrane综述组,均采用RoB1。这75项试验跨越了50年,除了一项以外,其余都是用英语发表的。Human-ChatGPT对“Overall RoB”评估的一致性为50.7% (95% CI 39.3%-62.0%),大大高于预期(p = 0.0015)。“随机化过程导致的RoB”的人- chatgpt一致性为78.7% (95% CI 69.4%-88.0%; p < 0.001)。ChatGPT-ChatGPT一致性为74.7% (95% CI 64.8%-84.6%; p < 0.001)。结论:ChatGPT似乎有一定的能力来评估RoB,不太可能是猜测或“幻觉”。对于“Overall RoB”的估计的一致性远远高于对一些人类评论者报告的一致性的估计,但是低于最高的估计。基于法学硕士的评估RoB的系统可能有助于简化和改进证据合成生产。
{"title":"Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers","authors":"Christopher James Rose,&nbsp;Julia Bidonde,&nbsp;Martin Ringsten,&nbsp;Julie Glanville,&nbsp;Thomas Potrebny,&nbsp;Chris Cooper,&nbsp;Ashley Elizabeth Muller,&nbsp;Hans Bugge Bergsund,&nbsp;Jose F. Meneses-Echavez,&nbsp;Rigmor C. Berg","doi":"10.1002/cesm.70048","DOIUrl":"https://doi.org/10.1002/cesm.70048","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background</h3>\u0000 \u0000 <p>Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT-ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT-4o (February 2025) throughout.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (<i>p</i> = 0.0015). Human-ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; <i>p</i> &lt; 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; <i>p</i> &lt; 0.001).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70048","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial Intelligence Search Tools for Evidence Synthesis: Comparative Analysis and Implementation Recommendations 用于证据合成的人工智能搜索工具:比较分析和实施建议
Pub Date : 2025-09-08 DOI: 10.1002/cesm.70045
Robin Featherstone, Melissa Walter, Danielle MacDougall, Eric Morenz, Sharon Bailey, Robyn Butcher, Caitlyn Ford, Hannah Loshak, David Kaunelis

To inform implementation recommendations for novel or emerging technologies, Research Information Services at Canada's Drug Agency conducted a multimodal research project involving a literature review, a retrospective comparative analysis, and a focus group on 3 Artificial Intelligence (AI) or automation tools for information retrieval (AI search tools): Lens.org, SpiderCite, and Microsoft Copilot. For the comparative analysis, the customary information retrieval practices used at Canada's Drug Agency served as our reference standard for comparison, and we used the eligible studies of 7 completed projects to measure tool performance. For searches conducted with our usual practice approaches and with each of the 3 tools, we calculated sensitivity/recall, number needed to read (NNR), time to search and screen, unique contributions, and the likely impact of the unique contributions on the projects’ findings. Our investigation confirmed that AI search tools have inconsistent and variable performance for the range of information retrieval tasks performed at Canada's Drug Agency. Implementation recommendations from this study informed a “fit for purpose” approach where Information Specialists leverage AI search tools for specific tasks or project types.

为了向新技术或新兴技术的实施建议提供信息,加拿大药品管理局的研究信息服务部门开展了一项多模式研究项目,包括文献综述、回顾性比较分析和3个人工智能(AI)或信息检索自动化工具(AI搜索工具)的焦点小组:Lens.org、SpiderCite和Microsoft Copilot。对于比较分析,加拿大药品管理局使用的习惯信息检索实践作为我们比较的参考标准,我们使用7个已完成项目的合格研究来衡量工具的性能。对于使用我们通常的实践方法和3种工具中的每一种进行的搜索,我们计算了灵敏度/召回率、需要阅读的数量(NNR)、搜索和筛选时间、独特贡献以及独特贡献对项目结果的可能影响。我们的调查证实,人工智能搜索工具在加拿大药品管理局执行的一系列信息检索任务中具有不一致和可变的性能。该研究的实施建议为信息专家利用人工智能搜索工具完成特定任务或项目类型提供了“适合目的”的方法。
{"title":"Artificial Intelligence Search Tools for Evidence Synthesis: Comparative Analysis and Implementation Recommendations","authors":"Robin Featherstone,&nbsp;Melissa Walter,&nbsp;Danielle MacDougall,&nbsp;Eric Morenz,&nbsp;Sharon Bailey,&nbsp;Robyn Butcher,&nbsp;Caitlyn Ford,&nbsp;Hannah Loshak,&nbsp;David Kaunelis","doi":"10.1002/cesm.70045","DOIUrl":"https://doi.org/10.1002/cesm.70045","url":null,"abstract":"<p>To inform implementation recommendations for novel or emerging technologies, Research Information Services at Canada's Drug Agency conducted a multimodal research project involving a literature review, a retrospective comparative analysis, and a focus group on 3 Artificial Intelligence (AI) or automation tools for information retrieval (AI search tools): Lens.org, SpiderCite, and Microsoft Copilot. For the comparative analysis, the customary information retrieval practices used at Canada's Drug Agency served as our reference standard for comparison, and we used the eligible studies of 7 completed projects to measure tool performance. For searches conducted with our usual practice approaches and with each of the 3 tools, we calculated sensitivity/recall, number needed to read (NNR), time to search and screen, unique contributions, and the likely impact of the unique contributions on the projects’ findings. Our investigation confirmed that AI search tools have inconsistent and variable performance for the range of information retrieval tasks performed at Canada's Drug Agency. Implementation recommendations from this study informed a “fit for purpose” approach where Information Specialists leverage AI search tools for specific tasks or project types.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Role of Artificial Intelligence in Evidence Synthesis: Insights From the CORE Information Retrieval Forum 2025 探讨人工智能在证据合成中的作用:来自2025年CORE信息检索论坛的见解
Pub Date : 2025-09-07 DOI: 10.1002/cesm.70049
Claire H. Eastaugh, Madeleine Still, Fiona R. Beyer, Sheila A. Wallace, Hannah O'Keefe

Introduction

Information retrieval is essential for evidence synthesis, but developing search strategies can be labor-intensive and time-consuming. Automating these processes would be of benefit and interest, though it is unclear if Information Specialists (IS) are willing to adopt artificial intelligence (AI) methodologies or how they currently use them. In January 2025, the NIHR Innovation Observatory and NIHR Methodology Incubator for Applied Health and Care Research co-sponsored the inaugural CORE Information Retrieval Forum, where attendees discussed AI's role in information retrieval.

Methods

The CORE Information Retrieval Forum hosted a Knowledge Café. Participation was voluntary, and attendees could choose one of six event-themed discussion tables including AI. To support each discussion, a QR code linking to a virtual collaboration tool (Padlet; padlet.com) and a poster in the exhibition space were available throughout the day for attendee contributions.

Results

The CORE Information Retrieval Forum was attended by 131 IS from nine different types of organizations, with most from the UK and ten countries represented overall. Among the six discussion points available in the Knowledge Café, the AI table was the most popular, receiving the highest number of contributions (n = 49). Following the Forum, contributions to the AI topic were categorized into four themes: critical perception (n = 21), current uses (n = 19), specific tools (n = 2), and training wants/needs (n = 7).

Conclusions

While there are critical perspectives on the integration of AI in the IS space, this is not due to a reluctance to adapt and adopt but from a need for structure, education, training, ethical guidance, and systems to support the responsible use and transparency of AI. There is interest in automating repetitive and time-consuming tasks, but attendees reported a lack of appropriate supporting tools. More work is required to identify the suitability of currently available tools and their potential to complement the work conducted by IS.

信息检索对证据合成至关重要,但开发检索策略可能是劳动密集型和耗时的。自动化这些过程将是有益的和有趣的,尽管目前尚不清楚信息专家(is)是否愿意采用人工智能(AI)方法或他们目前如何使用它们。2025年1月,美国国立卫生研究院创新观察站和美国国立卫生研究院应用卫生与护理研究方法论孵化器共同主办了首届核心信息检索论坛,与会者讨论了人工智能在信息检索中的作用。方法CORE信息检索论坛举办知识论坛。参与是自愿的,与会者可以从包括人工智能在内的六个活动主题讨论桌中选择一个。为了支持每一次讨论,我们在展览空间提供了一个链接到虚拟协作工具(Padlet; padlet.com)的二维码和一张海报,供与会者投稿。结果来自9个不同类型组织的131名信息专家参加了CORE信息检索论坛,其中大多数来自英国和10个国家。在知识咖啡馆提供的六个讨论点中,人工智能表最受欢迎,收到的贡献数量最多(n = 49)。论坛结束后,对人工智能主题的贡献分为四个主题:关键感知(n = 21)、当前用途(n = 19)、特定工具(n = 2)和培训需求(n = 7)。虽然对人工智能在信息系统领域的整合存在批评观点,但这并不是因为不愿意适应和采用,而是因为需要结构、教育、培训、道德指导和系统来支持人工智能的负责任使用和透明度。人们对自动化重复和耗时的任务很感兴趣,但与会者报告缺乏适当的支持工具。需要做更多的工作来确定现有工具的适用性及其补充信息系统工作的潜力。
{"title":"Exploring the Role of Artificial Intelligence in Evidence Synthesis: Insights From the CORE Information Retrieval Forum 2025","authors":"Claire H. Eastaugh,&nbsp;Madeleine Still,&nbsp;Fiona R. Beyer,&nbsp;Sheila A. Wallace,&nbsp;Hannah O'Keefe","doi":"10.1002/cesm.70049","DOIUrl":"https://doi.org/10.1002/cesm.70049","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Introduction</h3>\u0000 \u0000 <p>Information retrieval is essential for evidence synthesis, but developing search strategies can be labor-intensive and time-consuming. Automating these processes would be of benefit and interest, though it is unclear if Information Specialists (IS) are willing to adopt artificial intelligence (AI) methodologies or how they currently use them. In January 2025, the NIHR Innovation Observatory and NIHR Methodology Incubator for Applied Health and Care Research co-sponsored the inaugural CORE Information Retrieval Forum, where attendees discussed AI's role in information retrieval.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>The CORE Information Retrieval Forum hosted a Knowledge Café. Participation was voluntary, and attendees could choose one of six event-themed discussion tables including AI. To support each discussion, a QR code linking to a virtual collaboration tool (Padlet; padlet.com) and a poster in the exhibition space were available throughout the day for attendee contributions.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The CORE Information Retrieval Forum was attended by 131 IS from nine different types of organizations, with most from the UK and ten countries represented overall. Among the six discussion points available in the Knowledge Café, the AI table was the most popular, receiving the highest number of contributions (<i>n</i> = 49). Following the Forum, contributions to the AI topic were categorized into four themes: critical perception (<i>n</i> = 21), current uses (<i>n</i> = 19), specific tools (<i>n</i> = 2), and training wants/needs (<i>n</i> = 7).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>While there are critical perspectives on the integration of AI in the IS space, this is not due to a reluctance to adapt and adopt but from a need for structure, education, training, ethical guidance, and systems to support the responsible use and transparency of AI. There is interest in automating repetitive and time-consuming tasks, but attendees reported a lack of appropriate supporting tools. More work is required to identify the suitability of currently available tools and their potential to complement the work conducted by IS.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70049","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human Versus Artificial Intelligence: Comparing Cochrane Authors' and ChatGPT's Risk of Bias Assessments 人类与人工智能:比较Cochrane作者和ChatGPT的偏倚风险评估
Pub Date : 2025-08-31 DOI: 10.1002/cesm.70044
Petek Eylul Taneri

Introduction

Systematic reviews and meta-analyses synthesize randomized trial data to guide clinical decisions but require significant time and resources. Artificial intelligence (AI) offers a promising solution to streamline evidence synthesis, aiding study selection, data extraction, and risk of bias assessment. This study aims to evaluate the performance of ChatGPT-4o in assessing the risk of bias in randomised controlled trials (RCTs) using the Risk of Bias 2 (RoB 2) tool, comparing its results with those conducted by human reviewers in Cochrane Reviews.

Methods

A sample of Cochrane Reviews utilizing the RoB 2 tool was identified through the Cochrane Database of Systematic Reviews (CDSR). Protocols, qualitative systematic reviews, and reviews employing alternative risk of bias assessment tools were excluded. The study utilized ChatGPT-4o to assess the risk of bias using a structured set of prompts corresponding to the RoB 2 domains. The agreement between ChatGPT-4o and consensus-based human reviewer assessments was evaluated using weighted kappa statistics. Additionally, accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were calculated. All analyses were performed using R Studio (version 4.3.0).

Results

A total of 42 Cochrane Reviews were screened, yielding a final sample of eight eligible reviews comprising 84 RCTs. The primary outcome of each included review was selected for risk of bias assessment. ChatGPT-4o demonstrated moderate agreement with human reviewers for the overall risk of bias judgments (weighted kappa = 0.51, 95% CI: 0.36–0.66). Agreement varied across domains, ranging from fair (κ = 0.20 for selection of the reported results) to moderate (κ = 0.59 for measurement of outcomes). ChatGPT-4o exhibited a sensitivity of 53% for identifying high-risk studies and a specificity of 99% for classifying low-risk studies.

Conclusion

This study shows that ChatGPT-4o can perform risk of bias assessments using RoB 2 with fair to moderate agreement with human reviewers. While AI-assisted risk of bias assessment remains imperfect, advancements in prompt engineering and model refinement may enhance performance. Future research should explore standardised prompts and investigate interrater reliability among human reviewers to provide a more robust comparison.

系统评价和荟萃分析综合随机试验数据来指导临床决策,但需要大量的时间和资源。人工智能(AI)为简化证据合成、辅助研究选择、数据提取和偏见风险评估提供了一个有前途的解决方案。本研究旨在评估chatgpt - 40在评估随机对照试验(RCTs)偏倚风险方面的表现,使用风险偏倚2 (RoB 2)工具,并将其结果与Cochrane Reviews中人工审稿人的结果进行比较。方法通过Cochrane系统评价数据库(Cochrane Database of Systematic Reviews, CDSR),利用RoB 2工具筛选Cochrane综述样本。排除方案、定性系统评价和采用替代偏倚风险评估工具的评价。该研究利用chatgpt - 40使用一组与RoB 2域相对应的结构化提示来评估偏倚风险。chatgpt - 40和基于共识的人类审稿人评估之间的一致性使用加权kappa统计进行评估。并计算准确性、敏感性、特异性、阳性预测值和阴性预测值。所有分析均使用R Studio(版本4.3.0)进行。结果共筛选了42篇Cochrane综述,最终得到8篇符合条件的综述,包括84项随机对照试验。每个纳入的综述的主要结局都被选择进行偏倚风险评估。chatgpt - 40在偏倚判断的总体风险方面与人类审稿人表现出中度一致(加权kappa = 0.51, 95% CI: 0.36-0.66)。各领域的一致性各不相同,从一般(选择报告结果的κ = 0.20)到中等(测量结果的κ = 0.59)。chatgpt - 40在识别高风险研究方面的敏感性为53%,在分类低风险研究方面的特异性为99%。本研究表明,chatgpt - 40可以使用rob2进行偏倚风险评估,并与人类审稿人达成公平或适度的一致。虽然人工智能辅助的偏见风险评估仍然不完善,但快速工程和模型改进方面的进步可能会提高性能。未来的研究应该探索标准化的提示,并调查人类审稿人之间的相互可靠性,以提供更可靠的比较。
{"title":"Human Versus Artificial Intelligence: Comparing Cochrane Authors' and ChatGPT's Risk of Bias Assessments","authors":"Petek Eylul Taneri","doi":"10.1002/cesm.70044","DOIUrl":"https://doi.org/10.1002/cesm.70044","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Introduction</h3>\u0000 \u0000 <p>Systematic reviews and meta-analyses synthesize randomized trial data to guide clinical decisions but require significant time and resources. Artificial intelligence (AI) offers a promising solution to streamline evidence synthesis, aiding study selection, data extraction, and risk of bias assessment. This study aims to evaluate the performance of ChatGPT-4o in assessing the risk of bias in randomised controlled trials (RCTs) using the Risk of Bias 2 (RoB 2) tool, comparing its results with those conducted by human reviewers in Cochrane Reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>A sample of Cochrane Reviews utilizing the RoB 2 tool was identified through the Cochrane Database of Systematic Reviews (CDSR). Protocols, qualitative systematic reviews, and reviews employing alternative risk of bias assessment tools were excluded. The study utilized ChatGPT-4o to assess the risk of bias using a structured set of prompts corresponding to the RoB 2 domains. The agreement between ChatGPT-4o and consensus-based human reviewer assessments was evaluated using weighted kappa statistics. Additionally, accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were calculated. All analyses were performed using R Studio (version 4.3.0).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>A total of 42 Cochrane Reviews were screened, yielding a final sample of eight eligible reviews comprising 84 RCTs. The primary outcome of each included review was selected for risk of bias assessment. ChatGPT-4o demonstrated moderate agreement with human reviewers for the overall risk of bias judgments (weighted kappa = 0.51, 95% CI: 0.36–0.66). Agreement varied across domains, ranging from fair (<i>κ</i> = 0.20 for selection of the reported results) to moderate (<i>κ</i> = 0.59 for measurement of outcomes). ChatGPT-4o exhibited a sensitivity of 53% for identifying high-risk studies and a specificity of 99% for classifying low-risk studies.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>This study shows that ChatGPT-4o can perform risk of bias assessments using RoB 2 with fair to moderate agreement with human reviewers. While AI-assisted risk of bias assessment remains imperfect, advancements in prompt engineering and model refinement may enhance performance. Future research should explore standardised prompts and investigate interrater reliability among human reviewers to provide a more robust comparison.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial Intelligence and Automation in Evidence Synthesis: An Investigation of Methods Employed in Cochrane, Campbell Collaboration, and Environmental Evidence Reviews 证据合成中的人工智能和自动化:Cochrane、Campbell协作和环境证据综述中使用方法的调查
Pub Date : 2025-08-28 DOI: 10.1002/cesm.70046
Kristen L. Scotti, Sarah Young, Melanie A. Gainey, Haoyong Lan

Automation, including Machine Learning (ML), is increasingly being explored to reduce the time and effort involved in evidence syntheses, yet its adoption and reporting practices remain under-examined across disciplines (e.g., health sciences, education, and policy). This review assesses the use of automation, including ML-based techniques, in 2271 evidence syntheses published between 2017 and 2024 in the Cochrane Database of Systematic Reviews, and the journals Campbell Systematic Reviews, and Environmental Evidence. We focus on automation across four review steps: search, screening, data extraction, and analysis/synthesis. We systematically identified eligible studies from the three sources and developed a classification system to distinguish between manual, rules-based, ML-enabled, and ML-embedded tools. We then extracted data on tool use, ML integration, reporting practices, motivations for (and against) ML adoption, and the application of stopping criteria for ML-assisted screening. Only ~5% of studies explicitly reported using ML, with most applications limited to screening tasks. Although ~12% employed ML-enabled tools, ~90% of those did not clarify whether ML functionalities were actually utilized. Living reviews showed higher relative ML integration (~15%), but overall uptake remains limited. Previous work has shown that common barriers to broader adoption included limited guidance, low user awareness, and concerns over reliability. Despite ML's potential to streamline evidence syntheses, its integration remains limited and inconsistently reported. Improved transparency, clearer reporting standards, and greater user training are needed to support responsible adoption. As the research literature grows, automation will become increasingly essential—but only if challenges in usability, reproducibility, and trust are addressed.

人们越来越多地探索自动化,包括机器学习(ML),以减少证据合成所涉及的时间和精力,但跨学科(例如健康科学、教育和政策)对其采用和报告实践的审查仍然不足。本综述评估了2017年至2024年间发表在Cochrane系统评价数据库以及《Campbell系统评价》和《环境证据》期刊上的2271项证据综合中自动化的使用情况,包括基于ml的技术。我们专注于四个审查步骤的自动化:搜索、筛选、数据提取和分析/合成。我们系统地从三个来源中确定了合格的研究,并开发了一个分类系统来区分手动的、基于规则的、支持ml的和嵌入ml的工具。然后,我们提取了工具使用、机器学习集成、报告实践、采用(和反对)机器学习的动机以及机器学习辅助筛选的停止标准的应用方面的数据。只有约5%的研究明确报告使用ML,大多数应用仅限于筛选任务。虽然约12%的人使用了支持机器学习的工具,但其中约90%的人没有明确是否实际利用了机器学习功能。活体评价显示了较高的相对ML整合(~15%),但总体吸收仍然有限。先前的工作表明,广泛采用的常见障碍包括有限的指导、低用户意识和对可靠性的担忧。尽管机器学习在简化证据合成方面具有潜力,但其整合仍然有限,而且报道不一致。需要提高透明度、更清晰的报告标准和更多的用户培训来支持负责任的采用。随着研究文献的增长,自动化将变得越来越重要——但前提是要解决可用性、可重复性和信任方面的挑战。
{"title":"Artificial Intelligence and Automation in Evidence Synthesis: An Investigation of Methods Employed in Cochrane, Campbell Collaboration, and Environmental Evidence Reviews","authors":"Kristen L. Scotti,&nbsp;Sarah Young,&nbsp;Melanie A. Gainey,&nbsp;Haoyong Lan","doi":"10.1002/cesm.70046","DOIUrl":"https://doi.org/10.1002/cesm.70046","url":null,"abstract":"<p>Automation, including Machine Learning (ML), is increasingly being explored to reduce the time and effort involved in evidence syntheses, yet its adoption and reporting practices remain under-examined across disciplines (e.g., health sciences, education, and policy). This review assesses the use of automation, including ML-based techniques, in 2271 evidence syntheses published between 2017 and 2024 in the <i>Cochrane Database of Systematic Reviews</i>, and the journals <i>Campbell Systematic Reviews</i>, and <i>Environmental Evidence</i>. We focus on automation across four review steps: search, screening, data extraction, and analysis/synthesis. We systematically identified eligible studies from the three sources and developed a classification system to distinguish between manual, rules-based, ML-enabled, and ML-embedded tools. We then extracted data on tool use, ML integration, reporting practices, motivations for (and against) ML adoption, and the application of stopping criteria for ML-assisted screening. Only ~5% of studies explicitly reported using ML, with most applications limited to screening tasks. Although ~12% employed ML-enabled tools, ~90% of those did not clarify whether ML functionalities were actually utilized. Living reviews showed higher relative ML integration (~15%), but overall uptake remains limited. Previous work has shown that common barriers to broader adoption included limited guidance, low user awareness, and concerns over reliability. Despite ML's potential to streamline evidence syntheses, its integration remains limited and inconsistently reported. Improved transparency, clearer reporting standards, and greater user training are needed to support responsible adoption. As the research literature grows, automation will become increasingly essential—but only if challenges in usability, reproducibility, and trust are addressed.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70046","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-Analysis Using Time-to-Event Data: A Tutorial 使用事件时间数据的元分析:教程
Pub Date : 2025-08-26 DOI: 10.1002/cesm.70041
Ashma Krishan, Kerry Dwan

This tutorial focuses on trials that assess time-to-event outcomes. We explain what hazard ratios are, how to interpret them and demonstrate how to include time-to-event data in a meta-analysis. Examples are presented to help with understanding. Accompanying the tutorial is a micro learning module, where we demonstrate a few approaches and give you the chance to practice calculating the hazard ratio. Time-to-event micro learning module https://links.cochrane.org/cesm/tutorials/time-to-event-data.

本教程侧重于评估时间到事件结果的试验。我们解释了什么是风险比,如何解释它们,并演示了如何在荟萃分析中包含事件时间数据。举例来帮助理解。本教程附带了一个微型学习模块,我们在其中演示了几种方法,并让您有机会练习计算风险比。时间到事件微学习模块https://links.cochrane.org/cesm/tutorials/time-to-event-data。
{"title":"Meta-Analysis Using Time-to-Event Data: A Tutorial","authors":"Ashma Krishan,&nbsp;Kerry Dwan","doi":"10.1002/cesm.70041","DOIUrl":"https://doi.org/10.1002/cesm.70041","url":null,"abstract":"<p>This tutorial focuses on trials that assess time-to-event outcomes. We explain what hazard ratios are, how to interpret them and demonstrate how to include time-to-event data in a meta-analysis. Examples are presented to help with understanding. Accompanying the tutorial is a micro learning module, where we demonstrate a few approaches and give you the chance to practice calculating the hazard ratio. Time-to-event micro learning module https://links.cochrane.org/cesm/tutorials/time-to-event-data.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144897288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lifecycles of Cochrane Systematic Reviews (2003–2024): A Bibliographic Study Cochrane系统综述的生命周期(2003-2024):文献研究
Pub Date : 2025-08-17 DOI: 10.1002/cesm.70043
Shiyin Li, Chong Wu, Zichen Zhang, Mengli Xiao, Mohammad Hassan Murad, Lifeng Lin

Background and Objectives

The relevance of Cochrane systematic reviews depends on timely completion and updates. This study aimed to empirically assess the lifecycles of Cochrane reviews published from 2003 to 2024, including transitions from protocol to review, update patterns, and withdrawals.

Methods

We extracted data from Cochrane Library publications between 2003 and 2024. Each review topic was identified using a unique six-digit DOI-based ID. We recorded protocol publication, review publication, updates, and withdrawals (i.e., removed from the Cochrane Library for editorial or procedural reasons), calculating time intervals between stages and conducting subgroup analyses by review type.

Results

Of 8137 protocols, 71.9% progressed to reviews (median 25.7 months), 2.4% were updated during the protocol stage, and 10.0% were withdrawn. Among 8477 reviews, 64.3% were never updated by the time of our analysis; for those updated at least once, the median interval between updates was 57.2 months. Withdrawal occurred in 2.5% of reviews (median 67.6 months post-publication). Subgroup analyses showed variation across review types; diagnostic and qualitative reviews tended to have longer protocol-to-review times than other types of reviews.

Conclusions

Cochrane reviews show long development and update intervals, with variation by review type. Greater use of automation and targeted support may improve review efficiency and timeliness.

背景与目的Cochrane系统评价的相关性取决于及时完成和更新。本研究旨在对2003年至2024年间发表的Cochrane综述的生命周期进行实证评估,包括从方案到综述的转变、更新模式和退出。方法从Cochrane图书馆2003年至2024年的出版物中提取数据。每个审查主题使用唯一的六位数基于doi的ID进行标识。我们记录了方案发表、综述发表、更新和退出(即由于编辑或程序原因从Cochrane图书馆删除),计算各阶段之间的时间间隔,并按综述类型进行亚组分析。在8137个方案中,71.9%进展到审查阶段(中位25.7个月),2.4%在方案阶段更新,10.0%退出。在8477篇评论中,64.3%的评论在我们分析时从未更新过;对于那些至少更新一次的人,更新之间的中位数间隔为57.2个月。2.5%的综述出现撤回(发表后67.6个月)。亚组分析显示不同综述类型之间存在差异;诊断性和定性审查往往比其他类型的审查有更长的协议到审查的时间。结论Cochrane综述显示较长的发展和更新间隔,且因综述类型而异。更多地使用自动化和有针对性的支持可以提高审查的效率和及时性。
{"title":"Lifecycles of Cochrane Systematic Reviews (2003–2024): A Bibliographic Study","authors":"Shiyin Li,&nbsp;Chong Wu,&nbsp;Zichen Zhang,&nbsp;Mengli Xiao,&nbsp;Mohammad Hassan Murad,&nbsp;Lifeng Lin","doi":"10.1002/cesm.70043","DOIUrl":"https://doi.org/10.1002/cesm.70043","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background and Objectives</h3>\u0000 \u0000 <p>The relevance of Cochrane systematic reviews depends on timely completion and updates. This study aimed to empirically assess the lifecycles of Cochrane reviews published from 2003 to 2024, including transitions from protocol to review, update patterns, and withdrawals.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We extracted data from Cochrane Library publications between 2003 and 2024. Each review topic was identified using a unique six-digit DOI-based ID. We recorded protocol publication, review publication, updates, and withdrawals (i.e., removed from the Cochrane Library for editorial or procedural reasons), calculating time intervals between stages and conducting subgroup analyses by review type.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>Of 8137 protocols, 71.9% progressed to reviews (median 25.7 months), 2.4% were updated during the protocol stage, and 10.0% were withdrawn. Among 8477 reviews, 64.3% were never updated by the time of our analysis; for those updated at least once, the median interval between updates was 57.2 months. Withdrawal occurred in 2.5% of reviews (median 67.6 months post-publication). Subgroup analyses showed variation across review types; diagnostic and qualitative reviews tended to have longer protocol-to-review times than other types of reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>Cochrane reviews show long development and update intervals, with variation by review type. Greater use of automation and targeted support may improve review efficiency and timeliness.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144858600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Research Impact: A Toolkit for Stakeholder-Driven Prioritization of Systematic Review Topics 优化研究影响:利益相关者驱动的系统审查主题优先排序工具包
Pub Date : 2025-08-14 DOI: 10.1002/cesm.70039
Dyon Hoekstra, Stefan K. Lhachimi

Intro

The prioritization of topics for evidence synthesis is crucial for maximizing the relevance and impact of systematic reviews. This article introduces a comprehensive toolkit designed to facilitate a structured, multi-step framework for engaging a broad spectrum of stakeholders in the prioritization process, ensuring the selection of topics that are both relevant and applicable.

Methods

We detail an open-source framework comprising 11 coherent steps, segmented into scoping and Delphi stages, to offer a flexible and resource-efficient approach for stakeholder involvement in research priority setting.

Results

The toolkit provides ready-to-use tools for the development, application, and analysis of the framework, including templates for online surveys developed with free open-source software, ensuring ease of replication and adaptation in various research fields. The framework supports the transparent and systematic development and assessment of systematic review topics, with a particular focus on stakeholder-refined assessment criteria.

Conclusion

Our toolkit enhances the transparency and ease of the priority-setting process. Targeted primarily at organizations and research groups seeking to allocate resources for future research based on stakeholder needs, this toolkit stands as a valuable resource for informed decision-making in research prioritization.

证据综合主题的优先顺序对于最大限度地发挥系统评价的相关性和影响至关重要。本文介绍了一个全面的工具包,旨在促进一个结构化的多步骤框架,使广泛的利益相关者参与优先排序过程,确保选择相关且适用的主题。我们详细介绍了一个开源框架,包括11个连贯的步骤,分为范围界定和德尔菲阶段,为利益相关者参与研究优先级设置提供了一种灵活和资源高效的方法。该工具包为框架的开发、应用和分析提供了现成的工具,包括使用免费开源软件开发的在线调查模板,确保易于复制和适应各种研究领域。该框架支持透明和系统地制定和评估系统审查主题,特别侧重于利益攸关方细化的评估标准。结论:我们的工具包提高了确定优先事项过程的透明度和便利性。该工具包主要针对组织和研究小组,旨在根据利益相关者的需求为未来的研究分配资源,它是研究优先级决策的宝贵资源。
{"title":"Optimizing Research Impact: A Toolkit for Stakeholder-Driven Prioritization of Systematic Review Topics","authors":"Dyon Hoekstra,&nbsp;Stefan K. Lhachimi","doi":"10.1002/cesm.70039","DOIUrl":"https://doi.org/10.1002/cesm.70039","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Intro</h3>\u0000 \u0000 <p>The prioritization of topics for evidence synthesis is crucial for maximizing the relevance and impact of systematic reviews. This article introduces a comprehensive toolkit designed to facilitate a structured, multi-step framework for engaging a broad spectrum of stakeholders in the prioritization process, ensuring the selection of topics that are both relevant and applicable.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We detail an open-source framework comprising 11 coherent steps, segmented into scoping and Delphi stages, to offer a flexible and resource-efficient approach for stakeholder involvement in research priority setting.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The toolkit provides ready-to-use tools for the development, application, and analysis of the framework, including templates for online surveys developed with free open-source software, ensuring ease of replication and adaptation in various research fields. The framework supports the transparent and systematic development and assessment of systematic review topics, with a particular focus on stakeholder-refined assessment criteria.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>Our toolkit enhances the transparency and ease of the priority-setting process. Targeted primarily at organizations and research groups seeking to allocate resources for future research based on stakeholder needs, this toolkit stands as a valuable resource for informed decision-making in research prioritization.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144832632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial chatgpt - 40与人类研究人员在Cochrane综述中撰写简单语言摘要的比较:一项盲法,随机非劣效对照试验
Pub Date : 2025-07-28 DOI: 10.1002/cesm.70037
Dagný Halla Ágústsdóttir, Jacob Rosenberg, Jason Joe Baker
<div> <section> <h3> Introduction</h3> <p>Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.</p> </section> <section> <h3> Methods</h3> <p>We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.</p> </section> <section> <h3> Results</h3> <p>The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (<i>p</i> < .001) and level of detail (<i>p</i> = .004), and 2 points higher on readability (<i>p</i> = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (<i>p</i> < .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, <i>p</i> < .001).</p> </section> <section>
Cochrane综述中的简单语言摘要旨在以一种没有医学背景的人也能理解的方式呈现关键信息。尽管Cochrane有作者指南,但这些摘要往往达不到预期目的。研究表明,它们通常难以阅读,并且在遵守指南方面各不相同。人工智能越来越多地应用于医学和学术界,其潜力正在各种角色中得到测试。本研究旨在调查chatgpt - 40是否能产生与Cochrane综述中已发表的普通语言摘要一样好的普通语言摘要。方法:我们进行了一项随机、单盲研究,共有36个简单的语言摘要:18个人类手写的摘要和18个chatgpt - 40生成的摘要,两个版本都用于相同的Cochrane综述。样本量计算为36份,每份摘要评估4次。每个摘要由Cochrane编辑小组的成员和外行人员分别审查两次。摘要以三种不同的方式进行评估:首先,所有评估者使用李克特量表从1到10来评估摘要的信息量、可读性和详细程度。他们还被问及是否会提交摘要,以及是否能确认是谁写的摘要。其次,Cochrane编辑组的成员根据Cochrane的简明语言摘要指南,使用清单对摘要进行评估,得分范围从0到10。最后,使用Lix和Flesch-Kincaid评分等客观工具分析摘要的可读性。使用random.org的随机序列生成器对chatgpt - 40或人类书面摘要进行随机化和分配,评估人员对摘要的作者身份不知情。结果与人类书面摘要相比,chatgpt - 40生成的简明语言摘要在信息(p < .001)和细节水平(p = .004)上得分高1分,在可读性(p = .002)上得分高2分。Lix和Flesch-Kincaid分数在两组摘要中都很高,尽管ChatGPT稍微容易阅读(p < .001)。评估人员发现很难区分ChatGPT和人类写的摘要,只有20%的人能正确识别ChatGPT生成的文本。与人类书面摘要相比,ChatGPT摘要更适合提交(64%对36%,p < .001)。chatgpt - 40在为Cochrane综述创建简单的语言摘要方面显示出希望,至少和人类一样好,在某些情况下略好。这项研究表明,chatgpt - 40可以成为一种工具,为Cochrane综述起草易于理解的简单语言摘要,其质量接近或匹配人类作者。临床试验注册和方案可在https://osf.io/aq6r5获得。
{"title":"ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial","authors":"Dagný Halla Ágústsdóttir,&nbsp;Jacob Rosenberg,&nbsp;Jason Joe Baker","doi":"10.1002/cesm.70037","DOIUrl":"https://doi.org/10.1002/cesm.70037","url":null,"abstract":"&lt;div&gt;\u0000 \u0000 \u0000 &lt;section&gt;\u0000 \u0000 &lt;h3&gt; Introduction&lt;/h3&gt;\u0000 \u0000 &lt;p&gt;Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.&lt;/p&gt;\u0000 &lt;/section&gt;\u0000 \u0000 &lt;section&gt;\u0000 \u0000 &lt;h3&gt; Methods&lt;/h3&gt;\u0000 \u0000 &lt;p&gt;We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.&lt;/p&gt;\u0000 &lt;/section&gt;\u0000 \u0000 &lt;section&gt;\u0000 \u0000 &lt;h3&gt; Results&lt;/h3&gt;\u0000 \u0000 &lt;p&gt;The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (&lt;i&gt;p&lt;/i&gt; &lt; .001) and level of detail (&lt;i&gt;p&lt;/i&gt; = .004), and 2 points higher on readability (&lt;i&gt;p&lt;/i&gt; = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (&lt;i&gt;p&lt;/i&gt; &lt; .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, &lt;i&gt;p&lt;/i&gt; &lt; .001).&lt;/p&gt;\u0000 &lt;/section&gt;\u0000 \u0000 &lt;section&gt;\u0000 ","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144714684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the Utility of Openalex to Identify Studies for Systematic Reviews: Methods and a Case Study 分析Openalex在系统评价中识别研究的效用:方法和案例研究
Pub Date : 2025-07-24 DOI: 10.1002/cesm.70038
Claire Stansfield, Hossein Dehdarirad, James Thomas, Silvy Mathew, Alison O'Mara-Eves

Open access scholarly resources have potential to simplify the literature search process, support more equitable access to research knowledge, and reduce biases from lack of access to relevant literature. OpenAlex is the world's largest open access database of academic research. However, it is not known whether OpenAlex is suitable for comprehensively identifying research for systematic reviews. We present an approach to measure the utility of OpenAlex as part of undertaking a systematic review, and present findings in the context of undertaking a systematic map on the implementation of diabetic eye screening. Procedures were developed to investigate OpenAlex's content coverage and capture, focusing on: (1) availability of relevant research records; (2) retrieval of relevant records from a Boolean search of OpenAlex (3) retrieval of relevant records from combining a PubMed Boolean search with a citations and related-items search of OpenAlex, and (4) efficient estimation of relevant records not identified elsewhere. The searches were conducted in July 2024 and repeated in March 2025 following removal of certain closed access abstracts from the OpenAlex data set. The original systematic review searches yielded 131 relevant records and 128 (98%) of these are present in OpenAlex. OpenAlex Boolean searches retrieved 126 (96%) of the 131 records, and partial screening yielded two relevant records not previously known to the review team. Retrieval was reduced to 123 (94%) when the searches were repeated in March 2025. However, the volume of records from the OpenAlex Boolean search was considerably greater than assessed for the original systematic map. Combining a Boolean search from PubMed and OpenAlex network graph searches yielded 93% recall. It is feasible and useful to investigate the use of OpenAlex as a key information resource for health topics. This approach can be modified to investigate OpenAlex for other systematic reviews. However, the volume of records obtained from searches is larger than that obtained from conventional sources, something that could be reduced using machine learning. Further investigations are needed, and our approach replicated in other reviews.

开放获取学术资源有可能简化文献检索过程,支持更公平地获取研究知识,并减少因缺乏相关文献而产生的偏见。OpenAlex是世界上最大的学术研究开放获取数据库。然而,目前尚不清楚OpenAlex是否适合全面识别用于系统评论的研究。我们提出了一种方法来衡量OpenAlex的效用,作为进行系统回顾的一部分,并在进行糖尿病眼科筛查实施的系统地图的背景下提出了研究结果。制定程序来调查OpenAlex的内容覆盖和捕获,重点是:(1)相关研究记录的可用性;(2)从OpenAlex的布尔搜索中检索相关记录;(3)将PubMed布尔搜索与OpenAlex的引文和相关条目搜索结合起来检索相关记录;(4)对其他地方未识别的相关记录进行有效估计。搜索于2024年7月进行,并在2025年3月从OpenAlex数据集中删除某些封闭访问摘要后重复搜索。最初的系统评论搜索产生了131条相关记录,其中128条(98%)存在于OpenAlex中。OpenAlex布尔搜索检索了131条记录中的126条(96%),部分筛选产生了审查小组以前不知道的两条相关记录。在2025年3月重复检索时,检索量减少到123(94%)。然而,来自OpenAlex布尔搜索的记录量远远大于原始系统地图的评估量。结合PubMed的布尔搜索和OpenAlex的网络图搜索,召回率达到93%。调查OpenAlex作为健康主题的关键信息资源的使用是可行和有用的。可以修改此方法以调查OpenAlex以进行其他系统审查。然而,从搜索中获得的记录量比从传统来源获得的记录量要大,这可以使用机器学习来减少。需要进一步的研究,我们的方法在其他综述中得到了重复。
{"title":"Analyzing the Utility of Openalex to Identify Studies for Systematic Reviews: Methods and a Case Study","authors":"Claire Stansfield,&nbsp;Hossein Dehdarirad,&nbsp;James Thomas,&nbsp;Silvy Mathew,&nbsp;Alison O'Mara-Eves","doi":"10.1002/cesm.70038","DOIUrl":"https://doi.org/10.1002/cesm.70038","url":null,"abstract":"<p>Open access scholarly resources have potential to simplify the literature search process, support more equitable access to research knowledge, and reduce biases from lack of access to relevant literature. OpenAlex is the world's largest open access database of academic research. However, it is not known whether OpenAlex is suitable for comprehensively identifying research for systematic reviews. We present an approach to measure the utility of OpenAlex as part of undertaking a systematic review, and present findings in the context of undertaking a systematic map on the implementation of diabetic eye screening. Procedures were developed to investigate OpenAlex's content coverage and capture, focusing on: (1) availability of relevant research records; (2) retrieval of relevant records from a Boolean search of OpenAlex (3) retrieval of relevant records from combining a PubMed Boolean search with a citations and related-items search of OpenAlex, and (4) efficient estimation of relevant records not identified elsewhere. The searches were conducted in July 2024 and repeated in March 2025 following removal of certain closed access abstracts from the OpenAlex data set. The original systematic review searches yielded 131 relevant records and 128 (98%) of these are present in OpenAlex. OpenAlex Boolean searches retrieved 126 (96%) of the 131 records, and partial screening yielded two relevant records not previously known to the review team. Retrieval was reduced to 123 (94%) when the searches were repeated in March 2025. However, the volume of records from the OpenAlex Boolean search was considerably greater than assessed for the original systematic map. Combining a Boolean search from PubMed and OpenAlex network graph searches yielded 93% recall. It is feasible and useful to investigate the use of OpenAlex as a key information resource for health topics. This approach can be modified to investigate OpenAlex for other systematic reviews. However, the volume of records obtained from searches is larger than that obtained from conventional sources, something that could be reduced using machine learning. Further investigations are needed, and our approach replicated in other reviews.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144688187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Cochrane Evidence Synthesis and Methods
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1