Pub Date : 2026-03-01Epub Date: 2025-11-17DOI: 10.1017/rsm.2025.10056
Javier Bracchiglione, Nicolás Meza, Dawid Pieper, Carole Lunny, Manuel Vargas-Peirano, Johanna Vicuña, Fernando Briceño, Roberto Garnham Parra, Ignacio Pérez Carrasco, Gerard Urrútia, Xavier Bonfill, Eva Madrid
Overlap of primary studies among multiple systematic reviews (SRs) is a major challenge when conducting overviews. The corrected covered area (CCA) is a metric computed from a matrix of evidence that quantifies overlap. Therefore, the assumptions used to generate the matrix may significantly affect the CCA. We aim to explore how these varying assumptions influence CCA calculations. We searched two databases for intervention-focused overviews published during 2023. Two reviewers conducted study selection and data extraction. We extracted overview characteristics and methods to handle overlap. For seven sampled overviews, we calculated overall and pairwise CCA across 16 scenarios, representing four matrix-construction assumptions. Of 193 included overviews, only 23 (11.9%) adhered to an overview-specific reporting guideline (e.g. PRIOR). Eighty-five (44.0%) did not address overlap; 14 (7.3%) only mentioned it in the discussion; and 94 (48.7%) incorporated it into methods or results (38 using CCA). Among the seven sampled overviews, CCA values varied depending on matrix-construction assumptions, ranging from 1.2% to 13.5% with the overall method and 0.0% to 15.7% with the pairwise method. CCA values may vary depending on the assumptions made during matrix construction, including scope, treatment of structural missingness, and handling of publication threads. This variability calls into question the uncritical use of current CCA thresholds and underscores the need for overview authors to report both overall and pairwise CCA calculations. Our preliminary guidance for transparently reporting matrix-construction assumptions may improve the accuracy and reproducibility of CCA assessments.
{"title":"Impact of matrix-construction assumptions on quantitative overlap assessment in overviews: A meta-research study.","authors":"Javier Bracchiglione, Nicolás Meza, Dawid Pieper, Carole Lunny, Manuel Vargas-Peirano, Johanna Vicuña, Fernando Briceño, Roberto Garnham Parra, Ignacio Pérez Carrasco, Gerard Urrútia, Xavier Bonfill, Eva Madrid","doi":"10.1017/rsm.2025.10056","DOIUrl":"10.1017/rsm.2025.10056","url":null,"abstract":"<p><p>Overlap of primary studies among multiple systematic reviews (SRs) is a major challenge when conducting overviews. The corrected covered area (CCA) is a metric computed from a matrix of evidence that quantifies overlap. Therefore, the assumptions used to generate the matrix may significantly affect the CCA. We aim to explore how these varying assumptions influence CCA calculations. We searched two databases for intervention-focused overviews published during 2023. Two reviewers conducted study selection and data extraction. We extracted overview characteristics and methods to handle overlap. For seven sampled overviews, we calculated overall and pairwise CCA across 16 scenarios, representing four matrix-construction assumptions. Of 193 included overviews, only 23 (11.9%) adhered to an overview-specific reporting guideline (e.g. PRIOR). Eighty-five (44.0%) did not address overlap; 14 (7.3%) only mentioned it in the discussion; and 94 (48.7%) incorporated it into methods or results (38 using CCA). Among the seven sampled overviews, CCA values varied depending on matrix-construction assumptions, ranging from 1.2% to 13.5% with the overall method and 0.0% to 15.7% with the pairwise method. CCA values may vary depending on the assumptions made during matrix construction, including scope, treatment of structural missingness, and handling of publication threads. This variability calls into question the uncritical use of current CCA thresholds and underscores the need for overview authors to report both overall and pairwise CCA calculations. Our preliminary guidance for transparently reporting matrix-construction assumptions may improve the accuracy and reproducibility of CCA assessments.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"348-364"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873615/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-13DOI: 10.1017/rsm.2025.10044
Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini
Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14-$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.
{"title":"Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction.","authors":"Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini","doi":"10.1017/rsm.2025.10044","DOIUrl":"10.1017/rsm.2025.10044","url":null,"abstract":"<p><p>Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14-$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"332-347"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873614/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-17DOI: 10.1017/rsm.2025.10054
Delphine S Courvoisier, Diana Buitrago-Garcia, Clément P Buclin, Nils Bürgisser, Michele Iudici, Denis Mongin
Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.
{"title":"Beyond human gold standards: A multimodel framework for automated abstract classification and information extraction.","authors":"Delphine S Courvoisier, Diana Buitrago-Garcia, Clément P Buclin, Nils Bürgisser, Michele Iudici, Denis Mongin","doi":"10.1017/rsm.2025.10054","DOIUrl":"10.1017/rsm.2025.10054","url":null,"abstract":"<p><p>Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"365-377"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873610/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-13DOI: 10.1017/rsm.2025.10050
Juyoung Jung, Ariel M Aloe
Bayesian hierarchical models offer a principled framework for adjusting for study-level bias in meta-analysis, but their complexity and sensitivity to prior specifications necessitate a systematic framework for robust application. This study demonstrates the application of a Bayesian workflow to this challenge, comparing a standard random-effects model to a bias-adjustment model across a real-world dataset and a targeted simulation study. The workflow revealed a high sensitivity of results to the prior on bias probability, showing that while the simpler random-effects model had superior predictive accuracy as measured by the widely applicable information criterion, the bias-adjustment model successfully propagated uncertainty by producing wider, more conservative credible intervals. The simulation confirmed the model's ability to recover true parameters when priors were well-specified. These results establish the Bayesian workflow as a principled framework for diagnosing model sensitivities and ensuring the transparent application of complex bias-adjustment models in evidence synthesis.
{"title":"Bayesian workflow for bias-adjustment model in meta-analysis.","authors":"Juyoung Jung, Ariel M Aloe","doi":"10.1017/rsm.2025.10050","DOIUrl":"10.1017/rsm.2025.10050","url":null,"abstract":"<p><p>Bayesian hierarchical models offer a principled framework for adjusting for study-level bias in meta-analysis, but their complexity and sensitivity to prior specifications necessitate a systematic framework for robust application. This study demonstrates the application of a Bayesian workflow to this challenge, comparing a standard random-effects model to a bias-adjustment model across a real-world dataset and a targeted simulation study. The workflow revealed a high sensitivity of results to the prior on bias probability, showing that while the simpler random-effects model had superior predictive accuracy as measured by the widely applicable information criterion, the bias-adjustment model successfully propagated uncertainty by producing wider, more conservative credible intervals. The simulation confirmed the model's ability to recover true parameters when priors were well-specified. These results establish the Bayesian workflow as a principled framework for diagnosing model sensitivities and ensuring the transparent application of complex bias-adjustment models in evidence synthesis.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"293-313"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-13DOI: 10.1017/rsm.2025.10049
Michael Pearce, Shouhao Zhou
Ranking multiple interventions is a crucial task in network meta-analysis (NMA) to guide clinical and policy decisions. However, conventional ranking methods often oversimplify treatment distinctions, potentially yielding misleading conclusions due to inherent uncertainty in relative intervention effects. To address these limitations, we propose a novel Bayesian rank-clustering estimation approach, termed rank-clustering estimation (RaCE), specifically developed for NMA. Rather than identifying a single "best" intervention, RaCE enables the probabilistic clustering of interventions with similar effectiveness, offering a more nuanced and parsimonious interpretation. By decoupling the clustering procedure from the NMA modeling process, RaCE is a flexible and broadly applicable approach that can accommodate different types of outcomes (binary, continuous, and survival), modeling approaches (arm-based and contrast-based), and estimation frameworks (frequentist or Bayesian). Simulation studies demonstrate that RaCE effectively captures rank-clusters even under conditions of substantial uncertainty and overlapping intervention effects, providing more reasonable result interpretation than traditional single-ranking methods. We illustrate the practical utility of RaCE through an NMA application to frontline immunochemotherapies for follicular lymphoma, revealing clinically relevant clusters among treatments previously assumed to have distinct ranks. Overall, RaCE provides a valuable tool for researchers to enhance rank estimation and interpretability, facilitating evidence-based decision-making in complex intervention landscapes.
{"title":"RaCE: A rank-clustering estimation method for network meta-analysis.","authors":"Michael Pearce, Shouhao Zhou","doi":"10.1017/rsm.2025.10049","DOIUrl":"10.1017/rsm.2025.10049","url":null,"abstract":"<p><p>Ranking multiple interventions is a crucial task in network meta-analysis (NMA) to guide clinical and policy decisions. However, conventional ranking methods often oversimplify treatment distinctions, potentially yielding misleading conclusions due to inherent uncertainty in relative intervention effects. To address these limitations, we propose a novel Bayesian rank-clustering estimation approach, termed rank-clustering estimation (RaCE), specifically developed for NMA. Rather than identifying a single \"best\" intervention, RaCE enables the probabilistic clustering of interventions with similar effectiveness, offering a more nuanced and parsimonious interpretation. By decoupling the clustering procedure from the NMA modeling process, RaCE is a flexible and broadly applicable approach that can accommodate different types of outcomes (binary, continuous, and survival), modeling approaches (arm-based and contrast-based), and estimation frameworks (frequentist or Bayesian). Simulation studies demonstrate that RaCE effectively captures rank-clusters even under conditions of substantial uncertainty and overlapping intervention effects, providing more reasonable result interpretation than traditional single-ranking methods. We illustrate the practical utility of RaCE through an NMA application to frontline immunochemotherapies for follicular lymphoma, revealing clinically relevant clusters among treatments previously assumed to have distinct ranks. Overall, RaCE provides a valuable tool for researchers to enhance rank estimation and interpretability, facilitating evidence-based decision-making in complex intervention landscapes.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"314-331"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-10-23DOI: 10.1017/rsm.2025.10042
Romy Menghao Jia, Cindy Stern
Critical appraisal is a core component of JBI qualitative evidence synthesis, offering insights into the quality of included studies and their potential influence on synthesized findings. However, limited guidance exists on whether, when, and how to exclude studies based on appraisal results. This study examined the methods used in JBI qualitative systematic reviews and the implications for synthesized findings. In this study, a systematic analysis of qualitative reviews published between 2018 and 2022 in JBI Evidence Synthesis was conducted. Data on decisions and their justifications were extracted from reviews and protocols. Descriptive and content analysis explored variations in the reported methods. Forty-five reviews were included. Approaches reported varied widely: 24% of reviews included all studies regardless of quality, while others applied exclusion criteria (36%), cutoff scores (11%), or multiple methods (9%). Limited justifications were provided for the approaches. Few reviews cited methodological references to support their decisions. Review authors reported their approach in various sections of the review, with inconsistencies identified in 18% of the sample. In addition, unclear or ambiguous descriptions were also identified in 18% of the included reviews. No clear differences were observed in ConQual scores between reviews that excluded studies and those that did not. Overall, the variability raises concerns about the credibility, transparency, and reproducibility of JBI qualitative systematic reviews. Decisions regarding the inclusion or exclusion of studies based on critical appraisal need to be clearly justified and consistently reported. Further methodological research is needed to support rigorous decision-making and to improve the reliability of synthesized findings.
{"title":"The inclusion or exclusion of studies based on critical appraisal results in JBI qualitative systematic reviews: An analysis of practices.","authors":"Romy Menghao Jia, Cindy Stern","doi":"10.1017/rsm.2025.10042","DOIUrl":"10.1017/rsm.2025.10042","url":null,"abstract":"<p><p>Critical appraisal is a core component of JBI qualitative evidence synthesis, offering insights into the quality of included studies and their potential influence on synthesized findings. However, limited guidance exists on whether, when, and how to exclude studies based on appraisal results. This study examined the methods used in JBI qualitative systematic reviews and the implications for synthesized findings. In this study, a systematic analysis of qualitative reviews published between 2018 and 2022 in <i>JBI Evidence Synthesis</i> was conducted. Data on decisions and their justifications were extracted from reviews and protocols. Descriptive and content analysis explored variations in the reported methods. Forty-five reviews were included. Approaches reported varied widely: 24% of reviews included all studies regardless of quality, while others applied exclusion criteria (36%), cutoff scores (11%), or multiple methods (9%). Limited justifications were provided for the approaches. Few reviews cited methodological references to support their decisions. Review authors reported their approach in various sections of the review, with inconsistencies identified in 18% of the sample. In addition, unclear or ambiguous descriptions were also identified in 18% of the included reviews. No clear differences were observed in ConQual scores between reviews that excluded studies and those that did not. Overall, the variability raises concerns about the credibility, transparency, and reproducibility of JBI qualitative systematic reviews. Decisions regarding the inclusion or exclusion of studies based on critical appraisal need to be clearly justified and consistently reported. Further methodological research is needed to support rigorous decision-making and to improve the reliability of synthesized findings.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"277-292"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873616/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-10-22DOI: 10.1017/rsm.2025.10048
Zipporah Iheozor-Ejiofor, Jelena Savović, Russell J Bowater, Julian P T Higgins
The ROBINS-I tool is a commonly used tool to assess risk of bias in non-randomised studies of interventions (NRSI) included in systematic reviews. The reporting of ROBINS-I results is important for decision-makers using systematic reviews to understand the weaknesses of the evidence. In particular, systematic review authors should apply the tool according to the guidance provided. This study aims to describe how ROBINS-I guidance is currently applied by review authors. In January 2023, we undertook a citation search and screened titles and abstracts of records published in the previous 6 months. We included systematic reviews of non-randomised studies of intervention where ROBINS-I had been used for risk-of-bias assessment. Based on 10 criteria, we summarised the diverse ways in which reviews deviated from or reported the use of ROBINS-I. In total, 492 reviews met our inclusion criteria. Only one review met all the expectations of the ROBINS-I guidance. A small proportion of reviews deviated from the seven standard domains (3%), judgements (13%), or in other ways (1%). Of the 476 (97%) reviews that reported some ROBINS-I results, only 57 (12%) reviews reported ROBINS-I results at the outcome level compared with 203 reviews that reported ROBINS-I results at the study level alone. Most systematic reviews of NRSIs do not fully apply the ROBINS-I guidance. This raises concerns around the validity of the ROBINS-I results reported and the use of the evidence from these reviews in decision-making.
{"title":"The application of ROBINS-I guidance in systematic reviews of non-randomised studies: A descriptive study.","authors":"Zipporah Iheozor-Ejiofor, Jelena Savović, Russell J Bowater, Julian P T Higgins","doi":"10.1017/rsm.2025.10048","DOIUrl":"10.1017/rsm.2025.10048","url":null,"abstract":"<p><p>The ROBINS-I tool is a commonly used tool to assess risk of bias in non-randomised studies of interventions (NRSI) included in systematic reviews. The reporting of ROBINS-I results is important for decision-makers using systematic reviews to understand the weaknesses of the evidence. In particular, systematic review authors should apply the tool according to the guidance provided. This study aims to describe how ROBINS-I guidance is currently applied by review authors. In January 2023, we undertook a citation search and screened titles and abstracts of records published in the previous 6 months. We included systematic reviews of non-randomised studies of intervention where ROBINS-I had been used for risk-of-bias assessment. Based on 10 criteria, we summarised the diverse ways in which reviews deviated from or reported the use of ROBINS-I. In total, 492 reviews met our inclusion criteria. Only one review met all the expectations of the ROBINS-I guidance. A small proportion of reviews deviated from the seven standard domains (3%), judgements (13%), or in other ways (1%). Of the 476 (97%) reviews that reported some ROBINS-I results, only 57 (12%) reviews reported ROBINS-I results at the outcome level compared with 203 reviews that reported ROBINS-I results at the study level alone. Most systematic reviews of NRSIs do not fully apply the ROBINS-I guidance. This raises concerns around the validity of the ROBINS-I results reported and the use of the evidence from these reviews in decision-making.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"265-276"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873613/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-18DOI: 10.1017/rsm.2025.10051
Ani Movsisyan, Kolahta Asres Ioab, Jan William Himmels, Gina Loretta Bantle, Andreea Dobrescu, Signe Flottorp, Frode Forland, Arianna Gadinger, Christina Koscher-Kien, Irma Klerings, Joerg J Meerpohl, Barbara Nussbaumer-Streit, Brigitte Strahwald, Eva A Rehfuess
Effective public health decision-making relies on rigorous evidence synthesis and transparent processes to facilitate its use. However, existing methods guidance has primarily been developed within clinical medicine and may not sufficiently address the complexities of public health, such as population-level considerations, multiple evidence streams, and time-sensitive decision-making. This work contributes to the European Centre for Disease Prevention and Control initiative on methods guidance development for evidence synthesis and evidence-based public health advice by systematically identifying and mapping guidance from health and health-related disciplines.Structured searches were conducted across multiple scientific databases and websites of key institutions, followed by screening and data coding. Of the 17,386 records identified, 247 documents were classified as 'guidance products' providing a set of principles or recommendations on the overall process of developing evidence synthesis and evidence-based advice. While many were classified as 'generic' in scope, a majority originated from clinical medicine and focused on systematic reviews of intervention effects. Only 41 documents explicitly addressed public health. Key gaps included approaches for rapid evidence synthesis and decision-making and methods for synthesising evidence from laboratory research, disease burden, and prevalence studies.The findings highlight a need for methodological development that aligns with the realities of public health practice, particularly in emergency contexts. This review provides a key repository for methodologists, researchers, and decision-makers in public health, as well as clinical medicine and health care in Europe and worldwide, supporting the evolution of more inclusive and adaptable approaches to public health evidence synthesis and decision-making.
{"title":"Conducting evidence synthesis and developing evidence-based advice in public health and beyond: A scoping review and map of methods guidance.","authors":"Ani Movsisyan, Kolahta Asres Ioab, Jan William Himmels, Gina Loretta Bantle, Andreea Dobrescu, Signe Flottorp, Frode Forland, Arianna Gadinger, Christina Koscher-Kien, Irma Klerings, Joerg J Meerpohl, Barbara Nussbaumer-Streit, Brigitte Strahwald, Eva A Rehfuess","doi":"10.1017/rsm.2025.10051","DOIUrl":"10.1017/rsm.2025.10051","url":null,"abstract":"<p><p>Effective public health decision-making relies on rigorous evidence synthesis and transparent processes to facilitate its use. However, existing methods guidance has primarily been developed within clinical medicine and may not sufficiently address the complexities of public health, such as population-level considerations, multiple evidence streams, and time-sensitive decision-making. This work contributes to the European Centre for Disease Prevention and Control initiative on methods guidance development for evidence synthesis and evidence-based public health advice by systematically identifying and mapping guidance from health and health-related disciplines.Structured searches were conducted across multiple scientific databases and websites of key institutions, followed by screening and data coding. Of the 17,386 records identified, 247 documents were classified as 'guidance products' providing a set of principles or recommendations on the overall process of developing evidence synthesis and evidence-based advice. While many were classified as 'generic' in scope, a majority originated from clinical medicine and focused on systematic reviews of intervention effects. Only 41 documents explicitly addressed public health. Key gaps included approaches for rapid evidence synthesis and decision-making and methods for synthesising evidence from laboratory research, disease burden, and prevalence studies.The findings highlight a need for methodological development that aligns with the realities of public health practice, particularly in emergency contexts. This review provides a key repository for methodologists, researchers, and decision-makers in public health, as well as clinical medicine and health care in Europe and worldwide, supporting the evolution of more inclusive and adaptable approaches to public health evidence synthesis and decision-making.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"240-264"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873621/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-12-11DOI: 10.1017/rsm.2025.10058
Oluwaseun Farotimi, Adam Dunn, Caspar J Van Lissa, Joshua Richard Polanin, Dimitris Mavridis, Terri D Pigott
{"title":"Guidance for manuscript submissions testing the use of generative AI for systematic review and meta-analysis.","authors":"Oluwaseun Farotimi, Adam Dunn, Caspar J Van Lissa, Joshua Richard Polanin, Dimitris Mavridis, Terri D Pigott","doi":"10.1017/rsm.2025.10058","DOIUrl":"10.1017/rsm.2025.10058","url":null,"abstract":"","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"237-239"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873612/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-24DOI: 10.1017/rsm.2025.10052
Zihan Zhou, Zizhong Tian, Christine Peterson, Le Bao, Shouhao Zhou
Accurate assessment of adverse event (AE) incidence is critical in clinical research for drug safety. While meta-analysis serves as an essential tool to comprehensively synthesize the evidence across multiple studies, incomplete AE reporting in clinical trials remains a persistent challenge. In particular, AEs occurring below study-specific reporting thresholds are often omitted from publications, leading to left-censored data. Failure to account for these censored AE counts can result in biased AE incidence estimates. We present an R Shiny application that implements a Bayesian meta-analysis model specifically designed to incorporate censored AE data into the estimation process. This interactive tool provides a user-friendly interface for researchers to conduct AE meta-analyses and estimate the AE incidence probability using an unbiased approach. It also enables direct comparisons between models that either incorporate or ignore censoring, highlighting the biases introduced by conventional approaches. This tutorial demonstrates the Shiny application's functionality through an illustrative example on meta-analysis of PD-1/PD-L1 inhibitor safety and highlights the importance of this tool in improving AE risk assessment. Ultimately, the new Shiny app facilitates more accurate and transparent drug safety evaluations. The Shiny-MAGEC app is available at: https://zihanzhou98.shinyapps.io/Shiny-MAGEC/.
{"title":"Shiny-MAGEC: A Bayesian R shiny application for meta-analysis of censored adverse events.","authors":"Zihan Zhou, Zizhong Tian, Christine Peterson, Le Bao, Shouhao Zhou","doi":"10.1017/rsm.2025.10052","DOIUrl":"10.1017/rsm.2025.10052","url":null,"abstract":"<p><p>Accurate assessment of adverse event (AE) incidence is critical in clinical research for drug safety. While meta-analysis serves as an essential tool to comprehensively synthesize the evidence across multiple studies, incomplete AE reporting in clinical trials remains a persistent challenge. In particular, AEs occurring below study-specific reporting thresholds are often omitted from publications, leading to left-censored data. Failure to account for these censored AE counts can result in biased AE incidence estimates. We present an R Shiny application that implements a Bayesian meta-analysis model specifically designed to incorporate censored AE data into the estimation process. This interactive tool provides a user-friendly interface for researchers to conduct AE meta-analyses and estimate the AE incidence probability using an unbiased approach. It also enables direct comparisons between models that either incorporate or ignore censoring, highlighting the biases introduced by conventional approaches. This tutorial demonstrates the Shiny application's functionality through an illustrative example on meta-analysis of PD-1/PD-L1 inhibitor safety and highlights the importance of this tool in improving AE risk assessment. Ultimately, the new Shiny app facilitates more accurate and transparent drug safety evaluations. The Shiny-MAGEC app is available at: https://zihanzhou98.shinyapps.io/Shiny-MAGEC/.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"378-388"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873611/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}