首页 > 最新文献

Cochrane Evidence Synthesis and Methods最新文献

英文 中文
Retiring the Term “Weighted Mean Difference” in Contemporary Evidence Synthesis 退出当代证据综合中的“加权平均差”一词
Pub Date : 2025-09-11 DOI: 10.1002/cesm.70051
Lifeng Lin, Xing Xing, Wenshan Han, Jiayi Tong
<p>Evidence synthesis frequently involves quantitative analyses of continuous outcomes. A cross-sectional study examining Cochrane systematic reviews identified 6672 out of 22,453 meta-analyses (29.7%) involved continuous outcomes [<span>1</span>]. The primary effect measures employed in meta-analyses of continuous outcomes are the mean difference (MD) and standardized mean difference (SMD) [<span>2</span>]. The MD is appropriately applied when all included studies measure outcomes using identical scales (e.g., body weight in kilograms). In contrast, the SMD serves as a solution when studies utilize different measurement scales (e.g., varied questionnaire scoring methods). Although alternative measures (e.g., the ratio of means) exist [<span>3</span>], their application remains relatively infrequent.</p><p>Despite this conceptual clarity, the term “weighted mean difference” (WMD) appears frequently in the systematic review literature [<span>4</span>], which can lead to confusion about its relationship to the MD. In this article, we first clarify the distinction between MD and WMD, then describe the historical factors underlying the term's adoption and persistence, discuss why contemporary methods render it unnecessary, illustrate examples of misuse, and conclude with practical recommendations for clearer reporting.</p><p>The MD represents the straightforward difference between group means (e.g., intervention vs. control) for a continuous outcome. Although the true MD value relates to unknown population-level differences, practical research relies on sample estimates from individual studies. Meta-analysis systematically synthesizes these study-level MD estimates to derive an overall summary effect across studies.</p><p>The term WMD emerged historically to emphasize the weighted averaging process of meta-analyses, wherein each study contributes a sample MD weighted by its statistical precision (i.e., inverse variance) [<span>5</span>]. Typically, larger studies with smaller variances or narrower confidence intervals are assigned greater weights. Traditional meta-analytical methods, performed through either fixed-effect (also known as common-effect) or random-effects models, follow this inverse-variance weighting principle. Under fixed-effect models, study weights directly reflect the inverse of their variances, whereas random-effects models incorporate both within-study and between-study variances.</p><p>To contextualize the widespread adoption of WMD, we conducted a brief literature search using Google Scholar on June 12, 2025. Using exact-phrase queries in quotation marks, for each calendar year from 1990 to 2024, we recorded the counts for “weighted mean difference” AND “systematic review” and separately for “systematic review,” then calculated the yearly proportion (Figure 1). Google Scholar indexes titles, abstracts, and, when available, full texts, so counts reflect occurrences anywhere in the indexed record, and these counts are approximate.
证据合成通常涉及对连续结果的定量分析。一项检查Cochrane系统评价的横断面研究发现,在22,453项荟萃分析中,有6672项(29.7%)涉及连续结果。连续结局荟萃分析中采用的主要效应测量是平均差(MD)和标准化平均差(SMD)[2]。当所有纳入的研究使用相同的尺度(例如,以公斤为单位的体重)测量结果时,适当应用MD。相反,当研究使用不同的测量尺度(例如,不同的问卷评分方法)时,SMD可以作为一种解决方案。虽然有替代措施(例如,均值比率),但它们的应用仍然相对较少。尽管概念如此清晰,但“加权平均差”(WMD)一词经常出现在系统综述文献[4]中,这可能导致其与大规模杀伤性武器的关系混淆。在本文中,我们首先澄清了大规模杀伤性武器和大规模杀伤性武器之间的区别,然后描述了该术语被采用和持续存在的历史因素,讨论了为什么当代方法使它变得不必要,举例说明误用的例子。最后为更清晰的报告提出切实可行的建议。MD表示连续结果的组均值(例如,干预与对照)之间的直接差异。虽然真正的MD值与未知的群体水平差异有关,但实际研究依赖于个体研究的样本估计。荟萃分析系统地综合了这些研究水平的MD估计,以得出所有研究的总体总结效应。历史上,WMD一词的出现是为了强调荟萃分析的加权平均过程,其中每项研究贡献了一个样本MD,该MD由其统计精度(即逆方差)加权。通常,具有较小方差或较窄置信区间的较大研究被赋予较大权重。传统的元分析方法,通过固定效应(也称为共同效应)或随机效应模型执行,遵循这种逆方差加权原则。在固定效应模型下,研究权重直接反映其方差的反比,而随机效应模型同时包含研究内方差和研究间方差。为了了解大规模杀伤性武器被广泛采用的背景,我们于2025年6月12日使用谷歌Scholar进行了简短的文献检索。使用带引号的精确短语查询,从1990年到2024年的每个日历年,我们分别记录了“加权平均差”和“系统评论”的计数,并分别记录了“系统评论”的计数,然后计算了年度比例(图1)。谷歌Scholar对标题、摘要和全文(如果有的话)进行索引,因此计数反映了索引记录中任何地方的出现次数,这些计数是近似值。我们没有筛选正确与不正确使用的个别记录,因为我们的目标是描述术语的流行程度,而不是量化误用。因此,我们按照图1中报告的比例记录了使用随时间的演变。这项分析显示1996年前后大规模杀伤性武器的使用明显增加,紧跟着1995年4月Cochrane系统评价数据库(CDSR)的建立。Cochrane综述的重要作用可能极大地促进了这一术语的传播。最新的Cochrane手册[6]第6.5章证实了“加权平均差”一词在早期版本的CDSR中普遍存在,至少从2008年开始,手册版本中就出现了这样的警告:“基于这种效应测量的分析在历史上被称为[CDSR]中的[大规模杀伤性武器]分析。这个名字可能会让人混淆:虽然荟萃分析计算的是这些差异的加权平均值,但在计算单个研究的统计摘要时没有涉及加权。此外,所有的元分析都涉及估算的加权组合,但我们在提到其他方法时并不使用“加权”这个词。”另一个可能导致该术语继续使用的因素是引用安德拉德在2020年的声明,即“将汇总的MD更准确地描述为加权平均差或大规模杀伤性武器。”虽然这种解释在描述元分析池背后的统计过程时在技术上并不是不正确的,但它可能无意中鼓励更广泛或粗心地使用大规模杀伤性武器一词。尽管文献中已有关于大规模杀伤性武器的说明,但图1说明了大规模杀伤性武器的继续广泛使用。具体来说,虽然系统综述出版物总数在2018年左右达到峰值,然后下降(图1C),但提到大规模杀伤性武器的出版物数量和比例在2024年继续上升(图1A,B)。 虽然该术语并非在所有情况下都被滥用,但这一趋势表明,现有的警告影响有限,并强调了更明确术语的价值。这些历史性的和描述性的观察激发了对当前分析实践和术语的关注,如下所述。明确强调“大规模杀伤性武器”一词固有的权重可能会产生误导,因为权重是传统元分析方法的基础,无论结果类型(连续、二元、事件时间等)如何。然而,类似的术语,如“加权优势比”或“加权风险比”很少使用。因此,更一般的术语,如“汇集医学博士”、“联合医学博士”、“整体医学博士”或“元分析医学博士”可能更合适和一致。此外,当代证据合成方法的进步经常超出传统的反方差加权。现代荟萃分析,包括两两和网络应用,通常拟合为一阶段广义线性混合或贝叶斯层次模型,其中治疗效果由似然联合估计[8-10]。在这些模型中,精度是通过模型结构而不是通过明确的研究特定的逆方差权重来考虑的。因此,当结果量表相同时,汇总估计更清楚地报告为汇总MD或其他明确的描述符,如元分析MD;“大规模杀伤性武器”一词是不必要的,它可能暗示着一种明显的效果度量。然而,不精确的用法仍然存在于当前的文献中,如下图所示。关键的是,MD专门涉及个人研究结果,而WMD专门代表荟萃分析综合。尽管存在这种明显的区别,但一些系统综述错误地将个别研究效应标记为大规模杀伤性武器[11-14]。例如,最近发表在《美国医学会杂志》上的一篇系统综述不准确地报道了筛查组和对照组之间收缩压和舒张压的“汇总加权平均差异”。在这里,汇集的MD固有地表示权重,使得“加权”的添加变得多余和误导。此外,美国眼科杂志最近的一篇文章将森林图描述为“加权平均差异(WMD)……在每个研究中。”另一篇应用论文将森林图标注为“WMD和95% CI”,两者都意味着研究水平的WMD[13]。此外,方法书的一章明确指出,“表3.4给出了每个研究的大规模杀伤性武器和95%置信区间。”随着时间的推移,这种误用在系统综述中持续存在,包括许多发表在各种高影响力期刊上的综述。将研究水平的效应标记为“大规模杀伤性武器”会模糊研究的大规模杀伤性武器和汇总的荟萃分析估计之间的区别。例如,一个说明“每项研究的大规模杀伤性武器”的图表标题可能表明,每项研究产生的是一种大规模杀伤性武器,而不是一种大规模杀伤性武器,这可能会使证据使用者对汇总的内容感到困惑。更清晰的标记(例如,“每个研究的MD”与“汇总的MD”)减少了这种风险并提高了可解释性。本文强调了“大规模杀伤性武器”一词的潜在不恰当性,特别是它在证据合成中的个别研究中的不正确应用。大规模杀伤性武器主要源于Cochrane系统评价的早期实践,不再符合当代方法的需求和严谨性。因此,我们建议取消“大规模杀伤性武器”一词,并采用更清晰的术语,将MD用于研究水平的效果,将汇总MD或荟萃分析MD用于综合估计,以促进更清晰、方法上合理的沟通。林立峰:构思、资金获取、调研、写作-原稿、可视化、写作-审稿、编辑。星星:调查、写作、评审、编辑。韩文山:数据策展、写作评审与编辑、可视化。童佳怡:构思、写作、审稿、编辑。作者声明无利益冲突。
{"title":"Retiring the Term “Weighted Mean Difference” in Contemporary Evidence Synthesis","authors":"Lifeng Lin,&nbsp;Xing Xing,&nbsp;Wenshan Han,&nbsp;Jiayi Tong","doi":"10.1002/cesm.70051","DOIUrl":"https://doi.org/10.1002/cesm.70051","url":null,"abstract":"&lt;p&gt;Evidence synthesis frequently involves quantitative analyses of continuous outcomes. A cross-sectional study examining Cochrane systematic reviews identified 6672 out of 22,453 meta-analyses (29.7%) involved continuous outcomes [&lt;span&gt;1&lt;/span&gt;]. The primary effect measures employed in meta-analyses of continuous outcomes are the mean difference (MD) and standardized mean difference (SMD) [&lt;span&gt;2&lt;/span&gt;]. The MD is appropriately applied when all included studies measure outcomes using identical scales (e.g., body weight in kilograms). In contrast, the SMD serves as a solution when studies utilize different measurement scales (e.g., varied questionnaire scoring methods). Although alternative measures (e.g., the ratio of means) exist [&lt;span&gt;3&lt;/span&gt;], their application remains relatively infrequent.&lt;/p&gt;&lt;p&gt;Despite this conceptual clarity, the term “weighted mean difference” (WMD) appears frequently in the systematic review literature [&lt;span&gt;4&lt;/span&gt;], which can lead to confusion about its relationship to the MD. In this article, we first clarify the distinction between MD and WMD, then describe the historical factors underlying the term's adoption and persistence, discuss why contemporary methods render it unnecessary, illustrate examples of misuse, and conclude with practical recommendations for clearer reporting.&lt;/p&gt;&lt;p&gt;The MD represents the straightforward difference between group means (e.g., intervention vs. control) for a continuous outcome. Although the true MD value relates to unknown population-level differences, practical research relies on sample estimates from individual studies. Meta-analysis systematically synthesizes these study-level MD estimates to derive an overall summary effect across studies.&lt;/p&gt;&lt;p&gt;The term WMD emerged historically to emphasize the weighted averaging process of meta-analyses, wherein each study contributes a sample MD weighted by its statistical precision (i.e., inverse variance) [&lt;span&gt;5&lt;/span&gt;]. Typically, larger studies with smaller variances or narrower confidence intervals are assigned greater weights. Traditional meta-analytical methods, performed through either fixed-effect (also known as common-effect) or random-effects models, follow this inverse-variance weighting principle. Under fixed-effect models, study weights directly reflect the inverse of their variances, whereas random-effects models incorporate both within-study and between-study variances.&lt;/p&gt;&lt;p&gt;To contextualize the widespread adoption of WMD, we conducted a brief literature search using Google Scholar on June 12, 2025. Using exact-phrase queries in quotation marks, for each calendar year from 1990 to 2024, we recorded the counts for “weighted mean difference” AND “systematic review” and separately for “systematic review,” then calculated the yearly proportion (Figure 1). Google Scholar indexes titles, abstracts, and, when available, full texts, so counts reflect occurrences anywhere in the indexed record, and these counts are approximate.","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70051","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145037868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers 使用大型语言模型(chatgpt - 40)评估医学干预随机对照试验中的偏倚风险:与人类审稿人的审稿人一致
Pub Date : 2025-09-10 DOI: 10.1002/cesm.70048
Christopher James Rose, Julia Bidonde, Martin Ringsten, Julie Glanville, Thomas Potrebny, Chris Cooper, Ashley Elizabeth Muller, Hans Bugge Bergsund, Jose F. Meneses-Echavez, Rigmor C. Berg

Background

Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.

Methods

Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT-ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT-4o (February 2025) throughout.

Results

The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (p = 0.0015). Human-ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; p < 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; p < 0.001).

Conclusions

ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.

背景偏倚风险(RoB)评估是一项耗时且容易出现人为错误的高技能任务。RoB自动化工具以前使用的是使用相对较小的特定任务训练集构建的机器学习模型。大型语言模型(llm;例如,ChatGPT)是使用非任务特定的互联网规模训练集构建的复杂模型。它们表现出类似人类的能力,可能能够支持像RoB评估这样的任务。方法根据已发表的同行评议方案,我们随机抽取100篇Cochrane综述。评估医疗干预措施的新综述或更新综述,包括≥1个符合条件的试验,并使用Cochrane RoB1或RoB2进行人类共识评估,均符合条件。我们排除了在紧急情况下(例如COVID-19)进行的评价,以及有关公共卫生或福利的评价。我们从每个综述中随机抽取一个试验。采用个体随机或群随机设计的试验是合格的。我们从综述中提取了对试验的人类共识RoB评估,并从试验中提取了方法文本。我们使用25对回顾试验来开发ChatGPT提示,以使用试验方法文本评估RoB。我们使用提示和剩余的75个综述试验对来估计“总体RoB”(主要结局)和“随机化过程导致的RoB”的人类- chatgpt一致性,以及ChatGPT-ChatGPT(内部)“总体RoB”的一致性。我们自始至终使用的是chatgpt - 40(2025年2月)。结果75篇综述来自35个Cochrane综述组,均采用RoB1。这75项试验跨越了50年,除了一项以外,其余都是用英语发表的。Human-ChatGPT对“Overall RoB”评估的一致性为50.7% (95% CI 39.3%-62.0%),大大高于预期(p = 0.0015)。“随机化过程导致的RoB”的人- chatgpt一致性为78.7% (95% CI 69.4%-88.0%; p < 0.001)。ChatGPT-ChatGPT一致性为74.7% (95% CI 64.8%-84.6%; p < 0.001)。结论:ChatGPT似乎有一定的能力来评估RoB,不太可能是猜测或“幻觉”。对于“Overall RoB”的估计的一致性远远高于对一些人类评论者报告的一致性的估计,但是低于最高的估计。基于法学硕士的评估RoB的系统可能有助于简化和改进证据合成生产。
{"title":"Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers","authors":"Christopher James Rose,&nbsp;Julia Bidonde,&nbsp;Martin Ringsten,&nbsp;Julie Glanville,&nbsp;Thomas Potrebny,&nbsp;Chris Cooper,&nbsp;Ashley Elizabeth Muller,&nbsp;Hans Bugge Bergsund,&nbsp;Jose F. Meneses-Echavez,&nbsp;Rigmor C. Berg","doi":"10.1002/cesm.70048","DOIUrl":"https://doi.org/10.1002/cesm.70048","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background</h3>\u0000 \u0000 <p>Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT-ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT-4o (February 2025) throughout.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (<i>p</i> = 0.0015). Human-ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; <i>p</i> &lt; 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; <i>p</i> &lt; 0.001).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70048","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial Intelligence Search Tools for Evidence Synthesis: Comparative Analysis and Implementation Recommendations 用于证据合成的人工智能搜索工具:比较分析和实施建议
Pub Date : 2025-09-08 DOI: 10.1002/cesm.70045
Robin Featherstone, Melissa Walter, Danielle MacDougall, Eric Morenz, Sharon Bailey, Robyn Butcher, Caitlyn Ford, Hannah Loshak, David Kaunelis

To inform implementation recommendations for novel or emerging technologies, Research Information Services at Canada's Drug Agency conducted a multimodal research project involving a literature review, a retrospective comparative analysis, and a focus group on 3 Artificial Intelligence (AI) or automation tools for information retrieval (AI search tools): Lens.org, SpiderCite, and Microsoft Copilot. For the comparative analysis, the customary information retrieval practices used at Canada's Drug Agency served as our reference standard for comparison, and we used the eligible studies of 7 completed projects to measure tool performance. For searches conducted with our usual practice approaches and with each of the 3 tools, we calculated sensitivity/recall, number needed to read (NNR), time to search and screen, unique contributions, and the likely impact of the unique contributions on the projects’ findings. Our investigation confirmed that AI search tools have inconsistent and variable performance for the range of information retrieval tasks performed at Canada's Drug Agency. Implementation recommendations from this study informed a “fit for purpose” approach where Information Specialists leverage AI search tools for specific tasks or project types.

为了向新技术或新兴技术的实施建议提供信息,加拿大药品管理局的研究信息服务部门开展了一项多模式研究项目,包括文献综述、回顾性比较分析和3个人工智能(AI)或信息检索自动化工具(AI搜索工具)的焦点小组:Lens.org、SpiderCite和Microsoft Copilot。对于比较分析,加拿大药品管理局使用的习惯信息检索实践作为我们比较的参考标准,我们使用7个已完成项目的合格研究来衡量工具的性能。对于使用我们通常的实践方法和3种工具中的每一种进行的搜索,我们计算了灵敏度/召回率、需要阅读的数量(NNR)、搜索和筛选时间、独特贡献以及独特贡献对项目结果的可能影响。我们的调查证实,人工智能搜索工具在加拿大药品管理局执行的一系列信息检索任务中具有不一致和可变的性能。该研究的实施建议为信息专家利用人工智能搜索工具完成特定任务或项目类型提供了“适合目的”的方法。
{"title":"Artificial Intelligence Search Tools for Evidence Synthesis: Comparative Analysis and Implementation Recommendations","authors":"Robin Featherstone,&nbsp;Melissa Walter,&nbsp;Danielle MacDougall,&nbsp;Eric Morenz,&nbsp;Sharon Bailey,&nbsp;Robyn Butcher,&nbsp;Caitlyn Ford,&nbsp;Hannah Loshak,&nbsp;David Kaunelis","doi":"10.1002/cesm.70045","DOIUrl":"https://doi.org/10.1002/cesm.70045","url":null,"abstract":"<p>To inform implementation recommendations for novel or emerging technologies, Research Information Services at Canada's Drug Agency conducted a multimodal research project involving a literature review, a retrospective comparative analysis, and a focus group on 3 Artificial Intelligence (AI) or automation tools for information retrieval (AI search tools): Lens.org, SpiderCite, and Microsoft Copilot. For the comparative analysis, the customary information retrieval practices used at Canada's Drug Agency served as our reference standard for comparison, and we used the eligible studies of 7 completed projects to measure tool performance. For searches conducted with our usual practice approaches and with each of the 3 tools, we calculated sensitivity/recall, number needed to read (NNR), time to search and screen, unique contributions, and the likely impact of the unique contributions on the projects’ findings. Our investigation confirmed that AI search tools have inconsistent and variable performance for the range of information retrieval tasks performed at Canada's Drug Agency. Implementation recommendations from this study informed a “fit for purpose” approach where Information Specialists leverage AI search tools for specific tasks or project types.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Role of Artificial Intelligence in Evidence Synthesis: Insights From the CORE Information Retrieval Forum 2025 探讨人工智能在证据合成中的作用:来自2025年CORE信息检索论坛的见解
Pub Date : 2025-09-07 DOI: 10.1002/cesm.70049
Claire H. Eastaugh, Madeleine Still, Fiona R. Beyer, Sheila A. Wallace, Hannah O'Keefe

Introduction

Information retrieval is essential for evidence synthesis, but developing search strategies can be labor-intensive and time-consuming. Automating these processes would be of benefit and interest, though it is unclear if Information Specialists (IS) are willing to adopt artificial intelligence (AI) methodologies or how they currently use them. In January 2025, the NIHR Innovation Observatory and NIHR Methodology Incubator for Applied Health and Care Research co-sponsored the inaugural CORE Information Retrieval Forum, where attendees discussed AI's role in information retrieval.

Methods

The CORE Information Retrieval Forum hosted a Knowledge Café. Participation was voluntary, and attendees could choose one of six event-themed discussion tables including AI. To support each discussion, a QR code linking to a virtual collaboration tool (Padlet; padlet.com) and a poster in the exhibition space were available throughout the day for attendee contributions.

Results

The CORE Information Retrieval Forum was attended by 131 IS from nine different types of organizations, with most from the UK and ten countries represented overall. Among the six discussion points available in the Knowledge Café, the AI table was the most popular, receiving the highest number of contributions (n = 49). Following the Forum, contributions to the AI topic were categorized into four themes: critical perception (n = 21), current uses (n = 19), specific tools (n = 2), and training wants/needs (n = 7).

Conclusions

While there are critical perspectives on the integration of AI in the IS space, this is not due to a reluctance to adapt and adopt but from a need for structure, education, training, ethical guidance, and systems to support the responsible use and transparency of AI. There is interest in automating repetitive and time-consuming tasks, but attendees reported a lack of appropriate supporting tools. More work is required to identify the suitability of currently available tools and their potential to complement the work conducted by IS.

信息检索对证据合成至关重要,但开发检索策略可能是劳动密集型和耗时的。自动化这些过程将是有益的和有趣的,尽管目前尚不清楚信息专家(is)是否愿意采用人工智能(AI)方法或他们目前如何使用它们。2025年1月,美国国立卫生研究院创新观察站和美国国立卫生研究院应用卫生与护理研究方法论孵化器共同主办了首届核心信息检索论坛,与会者讨论了人工智能在信息检索中的作用。方法CORE信息检索论坛举办知识论坛。参与是自愿的,与会者可以从包括人工智能在内的六个活动主题讨论桌中选择一个。为了支持每一次讨论,我们在展览空间提供了一个链接到虚拟协作工具(Padlet; padlet.com)的二维码和一张海报,供与会者投稿。结果来自9个不同类型组织的131名信息专家参加了CORE信息检索论坛,其中大多数来自英国和10个国家。在知识咖啡馆提供的六个讨论点中,人工智能表最受欢迎,收到的贡献数量最多(n = 49)。论坛结束后,对人工智能主题的贡献分为四个主题:关键感知(n = 21)、当前用途(n = 19)、特定工具(n = 2)和培训需求(n = 7)。虽然对人工智能在信息系统领域的整合存在批评观点,但这并不是因为不愿意适应和采用,而是因为需要结构、教育、培训、道德指导和系统来支持人工智能的负责任使用和透明度。人们对自动化重复和耗时的任务很感兴趣,但与会者报告缺乏适当的支持工具。需要做更多的工作来确定现有工具的适用性及其补充信息系统工作的潜力。
{"title":"Exploring the Role of Artificial Intelligence in Evidence Synthesis: Insights From the CORE Information Retrieval Forum 2025","authors":"Claire H. Eastaugh,&nbsp;Madeleine Still,&nbsp;Fiona R. Beyer,&nbsp;Sheila A. Wallace,&nbsp;Hannah O'Keefe","doi":"10.1002/cesm.70049","DOIUrl":"https://doi.org/10.1002/cesm.70049","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Introduction</h3>\u0000 \u0000 <p>Information retrieval is essential for evidence synthesis, but developing search strategies can be labor-intensive and time-consuming. Automating these processes would be of benefit and interest, though it is unclear if Information Specialists (IS) are willing to adopt artificial intelligence (AI) methodologies or how they currently use them. In January 2025, the NIHR Innovation Observatory and NIHR Methodology Incubator for Applied Health and Care Research co-sponsored the inaugural CORE Information Retrieval Forum, where attendees discussed AI's role in information retrieval.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>The CORE Information Retrieval Forum hosted a Knowledge Café. Participation was voluntary, and attendees could choose one of six event-themed discussion tables including AI. To support each discussion, a QR code linking to a virtual collaboration tool (Padlet; padlet.com) and a poster in the exhibition space were available throughout the day for attendee contributions.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The CORE Information Retrieval Forum was attended by 131 IS from nine different types of organizations, with most from the UK and ten countries represented overall. Among the six discussion points available in the Knowledge Café, the AI table was the most popular, receiving the highest number of contributions (<i>n</i> = 49). Following the Forum, contributions to the AI topic were categorized into four themes: critical perception (<i>n</i> = 21), current uses (<i>n</i> = 19), specific tools (<i>n</i> = 2), and training wants/needs (<i>n</i> = 7).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>While there are critical perspectives on the integration of AI in the IS space, this is not due to a reluctance to adapt and adopt but from a need for structure, education, training, ethical guidance, and systems to support the responsible use and transparency of AI. There is interest in automating repetitive and time-consuming tasks, but attendees reported a lack of appropriate supporting tools. More work is required to identify the suitability of currently available tools and their potential to complement the work conducted by IS.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70049","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human Versus Artificial Intelligence: Comparing Cochrane Authors' and ChatGPT's Risk of Bias Assessments 人类与人工智能:比较Cochrane作者和ChatGPT的偏倚风险评估
Pub Date : 2025-08-31 DOI: 10.1002/cesm.70044
Petek Eylul Taneri

Introduction

Systematic reviews and meta-analyses synthesize randomized trial data to guide clinical decisions but require significant time and resources. Artificial intelligence (AI) offers a promising solution to streamline evidence synthesis, aiding study selection, data extraction, and risk of bias assessment. This study aims to evaluate the performance of ChatGPT-4o in assessing the risk of bias in randomised controlled trials (RCTs) using the Risk of Bias 2 (RoB 2) tool, comparing its results with those conducted by human reviewers in Cochrane Reviews.

Methods

A sample of Cochrane Reviews utilizing the RoB 2 tool was identified through the Cochrane Database of Systematic Reviews (CDSR). Protocols, qualitative systematic reviews, and reviews employing alternative risk of bias assessment tools were excluded. The study utilized ChatGPT-4o to assess the risk of bias using a structured set of prompts corresponding to the RoB 2 domains. The agreement between ChatGPT-4o and consensus-based human reviewer assessments was evaluated using weighted kappa statistics. Additionally, accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were calculated. All analyses were performed using R Studio (version 4.3.0).

Results

A total of 42 Cochrane Reviews were screened, yielding a final sample of eight eligible reviews comprising 84 RCTs. The primary outcome of each included review was selected for risk of bias assessment. ChatGPT-4o demonstrated moderate agreement with human reviewers for the overall risk of bias judgments (weighted kappa = 0.51, 95% CI: 0.36–0.66). Agreement varied across domains, ranging from fair (κ = 0.20 for selection of the reported results) to moderate (κ = 0.59 for measurement of outcomes). ChatGPT-4o exhibited a sensitivity of 53% for identifying high-risk studies and a specificity of 99% for classifying low-risk studies.

Conclusion

This study shows that ChatGPT-4o can perform risk of bias assessments using RoB 2 with fair to moderate agreement with human reviewers. While AI-assisted risk of bias assessment remains imperfect, advancements in prompt engineering and model refinement may enhance performance. Future research should explore standardised prompts and investigate interrater reliability among human reviewers to provide a more robust comparison.

系统评价和荟萃分析综合随机试验数据来指导临床决策,但需要大量的时间和资源。人工智能(AI)为简化证据合成、辅助研究选择、数据提取和偏见风险评估提供了一个有前途的解决方案。本研究旨在评估chatgpt - 40在评估随机对照试验(RCTs)偏倚风险方面的表现,使用风险偏倚2 (RoB 2)工具,并将其结果与Cochrane Reviews中人工审稿人的结果进行比较。方法通过Cochrane系统评价数据库(Cochrane Database of Systematic Reviews, CDSR),利用RoB 2工具筛选Cochrane综述样本。排除方案、定性系统评价和采用替代偏倚风险评估工具的评价。该研究利用chatgpt - 40使用一组与RoB 2域相对应的结构化提示来评估偏倚风险。chatgpt - 40和基于共识的人类审稿人评估之间的一致性使用加权kappa统计进行评估。并计算准确性、敏感性、特异性、阳性预测值和阴性预测值。所有分析均使用R Studio(版本4.3.0)进行。结果共筛选了42篇Cochrane综述,最终得到8篇符合条件的综述,包括84项随机对照试验。每个纳入的综述的主要结局都被选择进行偏倚风险评估。chatgpt - 40在偏倚判断的总体风险方面与人类审稿人表现出中度一致(加权kappa = 0.51, 95% CI: 0.36-0.66)。各领域的一致性各不相同,从一般(选择报告结果的κ = 0.20)到中等(测量结果的κ = 0.59)。chatgpt - 40在识别高风险研究方面的敏感性为53%,在分类低风险研究方面的特异性为99%。本研究表明,chatgpt - 40可以使用rob2进行偏倚风险评估,并与人类审稿人达成公平或适度的一致。虽然人工智能辅助的偏见风险评估仍然不完善,但快速工程和模型改进方面的进步可能会提高性能。未来的研究应该探索标准化的提示,并调查人类审稿人之间的相互可靠性,以提供更可靠的比较。
{"title":"Human Versus Artificial Intelligence: Comparing Cochrane Authors' and ChatGPT's Risk of Bias Assessments","authors":"Petek Eylul Taneri","doi":"10.1002/cesm.70044","DOIUrl":"https://doi.org/10.1002/cesm.70044","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Introduction</h3>\u0000 \u0000 <p>Systematic reviews and meta-analyses synthesize randomized trial data to guide clinical decisions but require significant time and resources. Artificial intelligence (AI) offers a promising solution to streamline evidence synthesis, aiding study selection, data extraction, and risk of bias assessment. This study aims to evaluate the performance of ChatGPT-4o in assessing the risk of bias in randomised controlled trials (RCTs) using the Risk of Bias 2 (RoB 2) tool, comparing its results with those conducted by human reviewers in Cochrane Reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>A sample of Cochrane Reviews utilizing the RoB 2 tool was identified through the Cochrane Database of Systematic Reviews (CDSR). Protocols, qualitative systematic reviews, and reviews employing alternative risk of bias assessment tools were excluded. The study utilized ChatGPT-4o to assess the risk of bias using a structured set of prompts corresponding to the RoB 2 domains. The agreement between ChatGPT-4o and consensus-based human reviewer assessments was evaluated using weighted kappa statistics. Additionally, accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were calculated. All analyses were performed using R Studio (version 4.3.0).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>A total of 42 Cochrane Reviews were screened, yielding a final sample of eight eligible reviews comprising 84 RCTs. The primary outcome of each included review was selected for risk of bias assessment. ChatGPT-4o demonstrated moderate agreement with human reviewers for the overall risk of bias judgments (weighted kappa = 0.51, 95% CI: 0.36–0.66). Agreement varied across domains, ranging from fair (<i>κ</i> = 0.20 for selection of the reported results) to moderate (<i>κ</i> = 0.59 for measurement of outcomes). ChatGPT-4o exhibited a sensitivity of 53% for identifying high-risk studies and a specificity of 99% for classifying low-risk studies.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>This study shows that ChatGPT-4o can perform risk of bias assessments using RoB 2 with fair to moderate agreement with human reviewers. While AI-assisted risk of bias assessment remains imperfect, advancements in prompt engineering and model refinement may enhance performance. Future research should explore standardised prompts and investigate interrater reliability among human reviewers to provide a more robust comparison.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial Intelligence and Automation in Evidence Synthesis: An Investigation of Methods Employed in Cochrane, Campbell Collaboration, and Environmental Evidence Reviews 证据合成中的人工智能和自动化:Cochrane、Campbell协作和环境证据综述中使用方法的调查
Pub Date : 2025-08-28 DOI: 10.1002/cesm.70046
Kristen L. Scotti, Sarah Young, Melanie A. Gainey, Haoyong Lan

Automation, including Machine Learning (ML), is increasingly being explored to reduce the time and effort involved in evidence syntheses, yet its adoption and reporting practices remain under-examined across disciplines (e.g., health sciences, education, and policy). This review assesses the use of automation, including ML-based techniques, in 2271 evidence syntheses published between 2017 and 2024 in the Cochrane Database of Systematic Reviews, and the journals Campbell Systematic Reviews, and Environmental Evidence. We focus on automation across four review steps: search, screening, data extraction, and analysis/synthesis. We systematically identified eligible studies from the three sources and developed a classification system to distinguish between manual, rules-based, ML-enabled, and ML-embedded tools. We then extracted data on tool use, ML integration, reporting practices, motivations for (and against) ML adoption, and the application of stopping criteria for ML-assisted screening. Only ~5% of studies explicitly reported using ML, with most applications limited to screening tasks. Although ~12% employed ML-enabled tools, ~90% of those did not clarify whether ML functionalities were actually utilized. Living reviews showed higher relative ML integration (~15%), but overall uptake remains limited. Previous work has shown that common barriers to broader adoption included limited guidance, low user awareness, and concerns over reliability. Despite ML's potential to streamline evidence syntheses, its integration remains limited and inconsistently reported. Improved transparency, clearer reporting standards, and greater user training are needed to support responsible adoption. As the research literature grows, automation will become increasingly essential—but only if challenges in usability, reproducibility, and trust are addressed.

人们越来越多地探索自动化,包括机器学习(ML),以减少证据合成所涉及的时间和精力,但跨学科(例如健康科学、教育和政策)对其采用和报告实践的审查仍然不足。本综述评估了2017年至2024年间发表在Cochrane系统评价数据库以及《Campbell系统评价》和《环境证据》期刊上的2271项证据综合中自动化的使用情况,包括基于ml的技术。我们专注于四个审查步骤的自动化:搜索、筛选、数据提取和分析/合成。我们系统地从三个来源中确定了合格的研究,并开发了一个分类系统来区分手动的、基于规则的、支持ml的和嵌入ml的工具。然后,我们提取了工具使用、机器学习集成、报告实践、采用(和反对)机器学习的动机以及机器学习辅助筛选的停止标准的应用方面的数据。只有约5%的研究明确报告使用ML,大多数应用仅限于筛选任务。虽然约12%的人使用了支持机器学习的工具,但其中约90%的人没有明确是否实际利用了机器学习功能。活体评价显示了较高的相对ML整合(~15%),但总体吸收仍然有限。先前的工作表明,广泛采用的常见障碍包括有限的指导、低用户意识和对可靠性的担忧。尽管机器学习在简化证据合成方面具有潜力,但其整合仍然有限,而且报道不一致。需要提高透明度、更清晰的报告标准和更多的用户培训来支持负责任的采用。随着研究文献的增长,自动化将变得越来越重要——但前提是要解决可用性、可重复性和信任方面的挑战。
{"title":"Artificial Intelligence and Automation in Evidence Synthesis: An Investigation of Methods Employed in Cochrane, Campbell Collaboration, and Environmental Evidence Reviews","authors":"Kristen L. Scotti,&nbsp;Sarah Young,&nbsp;Melanie A. Gainey,&nbsp;Haoyong Lan","doi":"10.1002/cesm.70046","DOIUrl":"https://doi.org/10.1002/cesm.70046","url":null,"abstract":"<p>Automation, including Machine Learning (ML), is increasingly being explored to reduce the time and effort involved in evidence syntheses, yet its adoption and reporting practices remain under-examined across disciplines (e.g., health sciences, education, and policy). This review assesses the use of automation, including ML-based techniques, in 2271 evidence syntheses published between 2017 and 2024 in the <i>Cochrane Database of Systematic Reviews</i>, and the journals <i>Campbell Systematic Reviews</i>, and <i>Environmental Evidence</i>. We focus on automation across four review steps: search, screening, data extraction, and analysis/synthesis. We systematically identified eligible studies from the three sources and developed a classification system to distinguish between manual, rules-based, ML-enabled, and ML-embedded tools. We then extracted data on tool use, ML integration, reporting practices, motivations for (and against) ML adoption, and the application of stopping criteria for ML-assisted screening. Only ~5% of studies explicitly reported using ML, with most applications limited to screening tasks. Although ~12% employed ML-enabled tools, ~90% of those did not clarify whether ML functionalities were actually utilized. Living reviews showed higher relative ML integration (~15%), but overall uptake remains limited. Previous work has shown that common barriers to broader adoption included limited guidance, low user awareness, and concerns over reliability. Despite ML's potential to streamline evidence syntheses, its integration remains limited and inconsistently reported. Improved transparency, clearer reporting standards, and greater user training are needed to support responsible adoption. As the research literature grows, automation will become increasingly essential—but only if challenges in usability, reproducibility, and trust are addressed.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70046","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-Analysis Using Time-to-Event Data: A Tutorial 使用事件时间数据的元分析:教程
Pub Date : 2025-08-26 DOI: 10.1002/cesm.70041
Ashma Krishan, Kerry Dwan

This tutorial focuses on trials that assess time-to-event outcomes. We explain what hazard ratios are, how to interpret them and demonstrate how to include time-to-event data in a meta-analysis. Examples are presented to help with understanding. Accompanying the tutorial is a micro learning module, where we demonstrate a few approaches and give you the chance to practice calculating the hazard ratio. Time-to-event micro learning module https://links.cochrane.org/cesm/tutorials/time-to-event-data.

本教程侧重于评估时间到事件结果的试验。我们解释了什么是风险比,如何解释它们,并演示了如何在荟萃分析中包含事件时间数据。举例来帮助理解。本教程附带了一个微型学习模块,我们在其中演示了几种方法,并让您有机会练习计算风险比。时间到事件微学习模块https://links.cochrane.org/cesm/tutorials/time-to-event-data。
{"title":"Meta-Analysis Using Time-to-Event Data: A Tutorial","authors":"Ashma Krishan,&nbsp;Kerry Dwan","doi":"10.1002/cesm.70041","DOIUrl":"https://doi.org/10.1002/cesm.70041","url":null,"abstract":"<p>This tutorial focuses on trials that assess time-to-event outcomes. We explain what hazard ratios are, how to interpret them and demonstrate how to include time-to-event data in a meta-analysis. Examples are presented to help with understanding. Accompanying the tutorial is a micro learning module, where we demonstrate a few approaches and give you the chance to practice calculating the hazard ratio. Time-to-event micro learning module https://links.cochrane.org/cesm/tutorials/time-to-event-data.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144897288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lifecycles of Cochrane Systematic Reviews (2003–2024): A Bibliographic Study Cochrane系统综述的生命周期(2003-2024):文献研究
Pub Date : 2025-08-17 DOI: 10.1002/cesm.70043
Shiyin Li, Chong Wu, Zichen Zhang, Mengli Xiao, Mohammad Hassan Murad, Lifeng Lin

Background and Objectives

The relevance of Cochrane systematic reviews depends on timely completion and updates. This study aimed to empirically assess the lifecycles of Cochrane reviews published from 2003 to 2024, including transitions from protocol to review, update patterns, and withdrawals.

Methods

We extracted data from Cochrane Library publications between 2003 and 2024. Each review topic was identified using a unique six-digit DOI-based ID. We recorded protocol publication, review publication, updates, and withdrawals (i.e., removed from the Cochrane Library for editorial or procedural reasons), calculating time intervals between stages and conducting subgroup analyses by review type.

Results

Of 8137 protocols, 71.9% progressed to reviews (median 25.7 months), 2.4% were updated during the protocol stage, and 10.0% were withdrawn. Among 8477 reviews, 64.3% were never updated by the time of our analysis; for those updated at least once, the median interval between updates was 57.2 months. Withdrawal occurred in 2.5% of reviews (median 67.6 months post-publication). Subgroup analyses showed variation across review types; diagnostic and qualitative reviews tended to have longer protocol-to-review times than other types of reviews.

Conclusions

Cochrane reviews show long development and update intervals, with variation by review type. Greater use of automation and targeted support may improve review efficiency and timeliness.

背景与目的Cochrane系统评价的相关性取决于及时完成和更新。本研究旨在对2003年至2024年间发表的Cochrane综述的生命周期进行实证评估,包括从方案到综述的转变、更新模式和退出。方法从Cochrane图书馆2003年至2024年的出版物中提取数据。每个审查主题使用唯一的六位数基于doi的ID进行标识。我们记录了方案发表、综述发表、更新和退出(即由于编辑或程序原因从Cochrane图书馆删除),计算各阶段之间的时间间隔,并按综述类型进行亚组分析。在8137个方案中,71.9%进展到审查阶段(中位25.7个月),2.4%在方案阶段更新,10.0%退出。在8477篇评论中,64.3%的评论在我们分析时从未更新过;对于那些至少更新一次的人,更新之间的中位数间隔为57.2个月。2.5%的综述出现撤回(发表后67.6个月)。亚组分析显示不同综述类型之间存在差异;诊断性和定性审查往往比其他类型的审查有更长的协议到审查的时间。结论Cochrane综述显示较长的发展和更新间隔,且因综述类型而异。更多地使用自动化和有针对性的支持可以提高审查的效率和及时性。
{"title":"Lifecycles of Cochrane Systematic Reviews (2003–2024): A Bibliographic Study","authors":"Shiyin Li,&nbsp;Chong Wu,&nbsp;Zichen Zhang,&nbsp;Mengli Xiao,&nbsp;Mohammad Hassan Murad,&nbsp;Lifeng Lin","doi":"10.1002/cesm.70043","DOIUrl":"https://doi.org/10.1002/cesm.70043","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background and Objectives</h3>\u0000 \u0000 <p>The relevance of Cochrane systematic reviews depends on timely completion and updates. This study aimed to empirically assess the lifecycles of Cochrane reviews published from 2003 to 2024, including transitions from protocol to review, update patterns, and withdrawals.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We extracted data from Cochrane Library publications between 2003 and 2024. Each review topic was identified using a unique six-digit DOI-based ID. We recorded protocol publication, review publication, updates, and withdrawals (i.e., removed from the Cochrane Library for editorial or procedural reasons), calculating time intervals between stages and conducting subgroup analyses by review type.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>Of 8137 protocols, 71.9% progressed to reviews (median 25.7 months), 2.4% were updated during the protocol stage, and 10.0% were withdrawn. Among 8477 reviews, 64.3% were never updated by the time of our analysis; for those updated at least once, the median interval between updates was 57.2 months. Withdrawal occurred in 2.5% of reviews (median 67.6 months post-publication). Subgroup analyses showed variation across review types; diagnostic and qualitative reviews tended to have longer protocol-to-review times than other types of reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>Cochrane reviews show long development and update intervals, with variation by review type. Greater use of automation and targeted support may improve review efficiency and timeliness.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144858600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Research Impact: A Toolkit for Stakeholder-Driven Prioritization of Systematic Review Topics 优化研究影响:利益相关者驱动的系统审查主题优先排序工具包
Pub Date : 2025-08-14 DOI: 10.1002/cesm.70039
Dyon Hoekstra, Stefan K. Lhachimi

Intro

The prioritization of topics for evidence synthesis is crucial for maximizing the relevance and impact of systematic reviews. This article introduces a comprehensive toolkit designed to facilitate a structured, multi-step framework for engaging a broad spectrum of stakeholders in the prioritization process, ensuring the selection of topics that are both relevant and applicable.

Methods

We detail an open-source framework comprising 11 coherent steps, segmented into scoping and Delphi stages, to offer a flexible and resource-efficient approach for stakeholder involvement in research priority setting.

Results

The toolkit provides ready-to-use tools for the development, application, and analysis of the framework, including templates for online surveys developed with free open-source software, ensuring ease of replication and adaptation in various research fields. The framework supports the transparent and systematic development and assessment of systematic review topics, with a particular focus on stakeholder-refined assessment criteria.

Conclusion

Our toolkit enhances the transparency and ease of the priority-setting process. Targeted primarily at organizations and research groups seeking to allocate resources for future research based on stakeholder needs, this toolkit stands as a valuable resource for informed decision-making in research prioritization.

证据综合主题的优先顺序对于最大限度地发挥系统评价的相关性和影响至关重要。本文介绍了一个全面的工具包,旨在促进一个结构化的多步骤框架,使广泛的利益相关者参与优先排序过程,确保选择相关且适用的主题。我们详细介绍了一个开源框架,包括11个连贯的步骤,分为范围界定和德尔菲阶段,为利益相关者参与研究优先级设置提供了一种灵活和资源高效的方法。该工具包为框架的开发、应用和分析提供了现成的工具,包括使用免费开源软件开发的在线调查模板,确保易于复制和适应各种研究领域。该框架支持透明和系统地制定和评估系统审查主题,特别侧重于利益攸关方细化的评估标准。结论:我们的工具包提高了确定优先事项过程的透明度和便利性。该工具包主要针对组织和研究小组,旨在根据利益相关者的需求为未来的研究分配资源,它是研究优先级决策的宝贵资源。
{"title":"Optimizing Research Impact: A Toolkit for Stakeholder-Driven Prioritization of Systematic Review Topics","authors":"Dyon Hoekstra,&nbsp;Stefan K. Lhachimi","doi":"10.1002/cesm.70039","DOIUrl":"https://doi.org/10.1002/cesm.70039","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Intro</h3>\u0000 \u0000 <p>The prioritization of topics for evidence synthesis is crucial for maximizing the relevance and impact of systematic reviews. This article introduces a comprehensive toolkit designed to facilitate a structured, multi-step framework for engaging a broad spectrum of stakeholders in the prioritization process, ensuring the selection of topics that are both relevant and applicable.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We detail an open-source framework comprising 11 coherent steps, segmented into scoping and Delphi stages, to offer a flexible and resource-efficient approach for stakeholder involvement in research priority setting.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The toolkit provides ready-to-use tools for the development, application, and analysis of the framework, including templates for online surveys developed with free open-source software, ensuring ease of replication and adaptation in various research fields. The framework supports the transparent and systematic development and assessment of systematic review topics, with a particular focus on stakeholder-refined assessment criteria.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>Our toolkit enhances the transparency and ease of the priority-setting process. Targeted primarily at organizations and research groups seeking to allocate resources for future research based on stakeholder needs, this toolkit stands as a valuable resource for informed decision-making in research prioritization.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144832632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial chatgpt - 40与人类研究人员在Cochrane综述中撰写简单语言摘要的比较:一项盲法,随机非劣效对照试验
Pub Date : 2025-07-28 DOI: 10.1002/cesm.70037
Dagný Halla Ágústsdóttir, Jacob Rosenberg, Jason Joe Baker
<div> <section> <h3> Introduction</h3> <p>Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.</p> </section> <section> <h3> Methods</h3> <p>We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.</p> </section> <section> <h3> Results</h3> <p>The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (<i>p</i> < .001) and level of detail (<i>p</i> = .004), and 2 points higher on readability (<i>p</i> = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (<i>p</i> < .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, <i>p</i> < .001).</p> </section> <section>
Cochrane综述中的简单语言摘要旨在以一种没有医学背景的人也能理解的方式呈现关键信息。尽管Cochrane有作者指南,但这些摘要往往达不到预期目的。研究表明,它们通常难以阅读,并且在遵守指南方面各不相同。人工智能越来越多地应用于医学和学术界,其潜力正在各种角色中得到测试。本研究旨在调查chatgpt - 40是否能产生与Cochrane综述中已发表的普通语言摘要一样好的普通语言摘要。方法:我们进行了一项随机、单盲研究,共有36个简单的语言摘要:18个人类手写的摘要和18个chatgpt - 40生成的摘要,两个版本都用于相同的Cochrane综述。样本量计算为36份,每份摘要评估4次。每个摘要由Cochrane编辑小组的成员和外行人员分别审查两次。摘要以三种不同的方式进行评估:首先,所有评估者使用李克特量表从1到10来评估摘要的信息量、可读性和详细程度。他们还被问及是否会提交摘要,以及是否能确认是谁写的摘要。其次,Cochrane编辑组的成员根据Cochrane的简明语言摘要指南,使用清单对摘要进行评估,得分范围从0到10。最后,使用Lix和Flesch-Kincaid评分等客观工具分析摘要的可读性。使用random.org的随机序列生成器对chatgpt - 40或人类书面摘要进行随机化和分配,评估人员对摘要的作者身份不知情。结果与人类书面摘要相比,chatgpt - 40生成的简明语言摘要在信息(p < .001)和细节水平(p = .004)上得分高1分,在可读性(p = .002)上得分高2分。Lix和Flesch-Kincaid分数在两组摘要中都很高,尽管ChatGPT稍微容易阅读(p < .001)。评估人员发现很难区分ChatGPT和人类写的摘要,只有20%的人能正确识别ChatGPT生成的文本。与人类书面摘要相比,ChatGPT摘要更适合提交(64%对36%,p < .001)。chatgpt - 40在为Cochrane综述创建简单的语言摘要方面显示出希望,至少和人类一样好,在某些情况下略好。这项研究表明,chatgpt - 40可以成为一种工具,为Cochrane综述起草易于理解的简单语言摘要,其质量接近或匹配人类作者。临床试验注册和方案可在https://osf.io/aq6r5获得。
{"title":"ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial","authors":"Dagný Halla Ágústsdóttir,&nbsp;Jacob Rosenberg,&nbsp;Jason Joe Baker","doi":"10.1002/cesm.70037","DOIUrl":"https://doi.org/10.1002/cesm.70037","url":null,"abstract":"&lt;div&gt;\u0000 \u0000 \u0000 &lt;section&gt;\u0000 \u0000 &lt;h3&gt; Introduction&lt;/h3&gt;\u0000 \u0000 &lt;p&gt;Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.&lt;/p&gt;\u0000 &lt;/section&gt;\u0000 \u0000 &lt;section&gt;\u0000 \u0000 &lt;h3&gt; Methods&lt;/h3&gt;\u0000 \u0000 &lt;p&gt;We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.&lt;/p&gt;\u0000 &lt;/section&gt;\u0000 \u0000 &lt;section&gt;\u0000 \u0000 &lt;h3&gt; Results&lt;/h3&gt;\u0000 \u0000 &lt;p&gt;The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (&lt;i&gt;p&lt;/i&gt; &lt; .001) and level of detail (&lt;i&gt;p&lt;/i&gt; = .004), and 2 points higher on readability (&lt;i&gt;p&lt;/i&gt; = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (&lt;i&gt;p&lt;/i&gt; &lt; .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, &lt;i&gt;p&lt;/i&gt; &lt; .001).&lt;/p&gt;\u0000 &lt;/section&gt;\u0000 \u0000 &lt;section&gt;\u0000 ","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144714684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Cochrane Evidence Synthesis and Methods
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1