Meta-analytic-predictive (MAP) priors have been proposed as a generic approach to deriving informative prior distributions, where external empirical data are processed to learn about certain parameter distributions. The use of MAP priors is also closely related to shrinkage estimation (also sometimes referred to as dynamic borrowing). A potentially odd situation arises when the external data consist only of a single study. Conceptually, this is not a problem, it only implies that certain prior assumptions gain in importance and need to be specified with particular care. We outline this important, not uncommon special case and demonstrate its implementation and interpretation based on the normal-normal hierarchical model. The approach is illustrated using example applications in clinical medicine.
{"title":"Meta-analytic-predictive priors based on a single study.","authors":"Christian Röver, Tim Friede","doi":"10.1017/rsm.2026.10081","DOIUrl":"https://doi.org/10.1017/rsm.2026.10081","url":null,"abstract":"<p><p>Meta-analytic-predictive (MAP) priors have been proposed as a generic approach to deriving informative prior distributions, where external empirical data are processed to learn about certain parameter distributions. The use of MAP priors is also closely related to shrinkage estimation (also sometimes referred to as <i>dynamic borrowing</i>). A potentially odd situation arises when the external data consist only of <i>a single study</i>. Conceptually, this is not a problem, it only implies that certain prior assumptions gain in importance and need to be specified with particular care. We outline this important, not uncommon special case and demonstrate its implementation and interpretation based on the normal-normal hierarchical model. The approach is illustrated using example applications in clinical medicine.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":" ","pages":"1-19"},"PeriodicalIF":6.1,"publicationDate":"2026-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147502632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raphaël Bentegeac, Bastien Le Guellec, Victor Leblanc, Rémi Lenain, Luc Dauchet, Victoria Gauthier, Erwin Gerard, Emmanuel Chazard, Philippe Amouyel, Estelle Aymes, Aghilès Hamroun
The exponential growth of scientific literature poses increasing challenges for evidence synthesis. Systematic reviews (SRs) usually rely on keyword-based database searches, which are limited by inconsistent terminology and indexing delays. Citation searching-identifying studies that cite or are cited by known relevant articles-offers a complementary route to uncover additional evidence but remains poorly automated and integrated into screening workflows. We developed BibliZap, an open-source, fully automated citation-searching tool built on Lens.org data, performing multi-level forward and backward citation searches with relevance-based ranking. Its performance was evaluated across 66 published SRs, comparing five approaches: (1) PubMed-only searches; (2) PubMed followed by BibliZap restricted to the top 500 ranked results; (3) PubMed followed by full BibliZap screening; and (4-5) two exploratory early-stop strategies where BibliZap was initiated after identifying the first or the first three PubMed relevant records. The primary outcome was sensitivity, with secondary assessments of screening workload and precision. When used after PubMed screening, BibliZap increased mean sensitivity from 75% to 97%, achieving complete recall in over half of the reviews. Screening only the top 500 outputs still allowed over 90% of reviews to reach or exceed 80% recall. BibliZap recovered a median of three additional included articles per review, not retrieved by PubMed, while adding a median of 6,450 additional records. Citation searching via BibliZap enhances the completeness of evidence retrieval in SRs based on restricted database searches and supports transparent, scalable workflows adaptable to rapid and exploratory review contexts.
{"title":"BibliZap: An exploratory evaluation of an automated multi-level citation searching tool for systematic and rapid reviews.","authors":"Raphaël Bentegeac, Bastien Le Guellec, Victor Leblanc, Rémi Lenain, Luc Dauchet, Victoria Gauthier, Erwin Gerard, Emmanuel Chazard, Philippe Amouyel, Estelle Aymes, Aghilès Hamroun","doi":"10.1017/rsm.2026.10079","DOIUrl":"https://doi.org/10.1017/rsm.2026.10079","url":null,"abstract":"<p><p>The exponential growth of scientific literature poses increasing challenges for evidence synthesis. Systematic reviews (SRs) usually rely on keyword-based database searches, which are limited by inconsistent terminology and indexing delays. Citation searching-identifying studies that cite or are cited by known relevant articles-offers a complementary route to uncover additional evidence but remains poorly automated and integrated into screening workflows. We developed BibliZap, an open-source, fully automated citation-searching tool built on Lens.org data, performing multi-level forward and backward citation searches with relevance-based ranking. Its performance was evaluated across 66 published SRs, comparing five approaches: (1) PubMed-only searches; (2) PubMed followed by BibliZap restricted to the top 500 ranked results; (3) PubMed followed by full BibliZap screening; and (4-5) two exploratory early-stop strategies where BibliZap was initiated after identifying the first or the first three PubMed relevant records. The primary outcome was sensitivity, with secondary assessments of screening workload and precision. When used after PubMed screening, BibliZap increased mean sensitivity from 75% to 97%, achieving complete recall in over half of the reviews. Screening only the top 500 outputs still allowed over 90% of reviews to reach or exceed 80% recall. BibliZap recovered a median of three additional included articles per review, not retrieved by PubMed, while adding a median of 6,450 additional records. Citation searching via BibliZap enhances the completeness of evidence retrieval in SRs based on restricted database searches and supports transparent, scalable workflows adaptable to rapid and exploratory review contexts.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":" ","pages":"1-14"},"PeriodicalIF":6.1,"publicationDate":"2026-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147502679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianqi Yu, Silvia Metelli, Theodoros Papakonstantinou, Anna Chaimani
Network meta-analysis (NMA) is a vital methodology for synthesizing evidence across multiple treatments and informing medical decision-making. However, effective visualization and interpretation of results from large networks of interventions remain challenging, particularly for non-specialists. NMAstudio 2.0 is an innovative, interactive web application designed to address these difficulties by streamlining NMA workflows and enhancing result visualization. Developed using Python and R, NMAstudio 2.0 seamlessly integrates with established NMA frameworks. Our exemplar application of NMAstudio 2.0 using a Cochrane Review comparing several treatments for chronic plaque psoriasis demonstrates its capacity to facilitate all crucial steps of an NMA. The application features an intuitive interface for uploading data, automating analyses, generating interactive visualizations such as network diagrams, forest plots, ranking plots, and producing unique outputs like boxplots for transitivity checks and bidimensional forest plots. Most outputs are dynamically linked with the network diagram, enabling users to interactively explore evidence networks, apply advanced filtering, and highlight specific features by selecting nodes or edges within the diagram. While NMAstudio 2.0 aims to simplify NMAs, it also incorporates steps during the data upload process to mitigate the risk of producing poorly reported NMAs. NMAstudio 2.0 represents a significant step forward in improving the usability and accessibility of NMA, offering researchers a robust, versatile platform for evidence synthesis. Its integration of advanced features with an emphasis on user experience positions it as a valuable resource for enhancing decision-making and promoting evidence-based practice across diverse contexts.
{"title":"NMAstudio 2.0: An interactive tool for network meta-analysis to enhance understanding, interpretation, and communication of the findings.","authors":"Tianqi Yu, Silvia Metelli, Theodoros Papakonstantinou, Anna Chaimani","doi":"10.1017/rsm.2026.10074","DOIUrl":"https://doi.org/10.1017/rsm.2026.10074","url":null,"abstract":"<p><p>Network meta-analysis (NMA) is a vital methodology for synthesizing evidence across multiple treatments and informing medical decision-making. However, effective visualization and interpretation of results from large networks of interventions remain challenging, particularly for non-specialists. NMAstudio 2.0 is an innovative, interactive web application designed to address these difficulties by streamlining NMA workflows and enhancing result visualization. Developed using Python and R, NMAstudio 2.0 seamlessly integrates with established NMA frameworks. Our exemplar application of NMAstudio 2.0 using a Cochrane Review comparing several treatments for chronic plaque psoriasis demonstrates its capacity to facilitate all crucial steps of an NMA. The application features an intuitive interface for uploading data, automating analyses, generating interactive visualizations such as network diagrams, forest plots, ranking plots, and producing unique outputs like boxplots for transitivity checks and bidimensional forest plots. Most outputs are dynamically linked with the network diagram, enabling users to interactively explore evidence networks, apply advanced filtering, and highlight specific features by selecting nodes or edges within the diagram. While NMAstudio 2.0 aims to simplify NMAs, it also incorporates steps during the data upload process to mitigate the risk of producing poorly reported NMAs. NMAstudio 2.0 represents a significant step forward in improving the usability and accessibility of NMA, offering researchers a robust, versatile platform for evidence synthesis. Its integration of advanced features with an emphasis on user experience positions it as a valuable resource for enhancing decision-making and promoting evidence-based practice across diverse contexts.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":" ","pages":"1-14"},"PeriodicalIF":6.1,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147363631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dazheng Zhang, Bingyu Zhang, Lu Li, Haitao Chu, Yong Chen
Meta-analysis synthesizes evidence from multiple randomized clinical trials and informs evidence-based practices across various medical domains. Recently, causally interpretable meta-analysis has been proposed and applied to treatment evaluations for target populations, requiring individual participant data (IPD). Standard meta-analysis assumes transportability or exchangeability of a (conditional) relative effect (such as relative risk or odds ratio), which may be violated when the relative effects are correlated with the baseline risks across clinical trials. In addition, the weighted average of some study-specific effect measures such as the (log) odds ratios or the (log) hazard ratios is non-collapsible and does not correspond to any target population. Furthermore, when the randomization ratios between treated versus untreated arms vary across trials, confounding bias may occur. To address these challenges, we propose a causal meta-analysis (CMA) framework using only aggregated data, enabling causally interpretable and accurate estimation for different target populations. The CMA adjusts its weights for treatment effect across various target populations, including the average treatment effect (ATE), the ATE on the treated (ATT) population, the ATE on the control (ATC) population, and the ATE in the overlap (ATO) population. Mathematically, we discover the connection between traditional meta-analysis estimators and CMAs. For example, the Mantel-Haenszel weighted meta-analysis is equivalent to the CMA with ATO.
{"title":"A causal meta-analysis framework for clinical trials with unequal randomization ratios.","authors":"Dazheng Zhang, Bingyu Zhang, Lu Li, Haitao Chu, Yong Chen","doi":"10.1017/rsm.2025.10069","DOIUrl":"https://doi.org/10.1017/rsm.2025.10069","url":null,"abstract":"<p><p>Meta-analysis synthesizes evidence from multiple randomized clinical trials and informs evidence-based practices across various medical domains. Recently, causally interpretable meta-analysis has been proposed and applied to treatment evaluations for target populations, requiring individual participant data (IPD). Standard meta-analysis assumes transportability or exchangeability of a (conditional) relative effect (such as relative risk or odds ratio), which may be violated when the relative effects are correlated with the baseline risks across clinical trials. In addition, the weighted average of some study-specific effect measures such as the (log) odds ratios or the (log) hazard ratios is non-collapsible and does not correspond to any target population. Furthermore, when the randomization ratios between treated versus untreated arms vary across trials, confounding bias may occur. To address these challenges, we propose a causal meta-analysis (CMA) framework using only aggregated data, enabling causally interpretable and accurate estimation for different target populations. The CMA adjusts its weights for treatment effect across various target populations, including the average treatment effect (ATE), the ATE on the treated (ATT) population, the ATE on the control (ATC) population, and the ATE in the overlap (ATO) population. Mathematically, we discover the connection between traditional meta-analysis estimators and CMAs. For example, the Mantel-Haenszel weighted meta-analysis is equivalent to the CMA with ATO.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":" ","pages":"1-12"},"PeriodicalIF":6.1,"publicationDate":"2026-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147352911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-17DOI: 10.1017/rsm.2025.10056
Javier Bracchiglione, Nicolás Meza, Dawid Pieper, Carole Lunny, Manuel Vargas-Peirano, Johanna Vicuña, Fernando Briceño, Roberto Garnham Parra, Ignacio Pérez Carrasco, Gerard Urrútia, Xavier Bonfill, Eva Madrid
Overlap of primary studies among multiple systematic reviews (SRs) is a major challenge when conducting overviews. The corrected covered area (CCA) is a metric computed from a matrix of evidence that quantifies overlap. Therefore, the assumptions used to generate the matrix may significantly affect the CCA. We aim to explore how these varying assumptions influence CCA calculations. We searched two databases for intervention-focused overviews published during 2023. Two reviewers conducted study selection and data extraction. We extracted overview characteristics and methods to handle overlap. For seven sampled overviews, we calculated overall and pairwise CCA across 16 scenarios, representing four matrix-construction assumptions. Of 193 included overviews, only 23 (11.9%) adhered to an overview-specific reporting guideline (e.g. PRIOR). Eighty-five (44.0%) did not address overlap; 14 (7.3%) only mentioned it in the discussion; and 94 (48.7%) incorporated it into methods or results (38 using CCA). Among the seven sampled overviews, CCA values varied depending on matrix-construction assumptions, ranging from 1.2% to 13.5% with the overall method and 0.0% to 15.7% with the pairwise method. CCA values may vary depending on the assumptions made during matrix construction, including scope, treatment of structural missingness, and handling of publication threads. This variability calls into question the uncritical use of current CCA thresholds and underscores the need for overview authors to report both overall and pairwise CCA calculations. Our preliminary guidance for transparently reporting matrix-construction assumptions may improve the accuracy and reproducibility of CCA assessments.
{"title":"Impact of matrix-construction assumptions on quantitative overlap assessment in overviews: A meta-research study.","authors":"Javier Bracchiglione, Nicolás Meza, Dawid Pieper, Carole Lunny, Manuel Vargas-Peirano, Johanna Vicuña, Fernando Briceño, Roberto Garnham Parra, Ignacio Pérez Carrasco, Gerard Urrútia, Xavier Bonfill, Eva Madrid","doi":"10.1017/rsm.2025.10056","DOIUrl":"10.1017/rsm.2025.10056","url":null,"abstract":"<p><p>Overlap of primary studies among multiple systematic reviews (SRs) is a major challenge when conducting overviews. The corrected covered area (CCA) is a metric computed from a matrix of evidence that quantifies overlap. Therefore, the assumptions used to generate the matrix may significantly affect the CCA. We aim to explore how these varying assumptions influence CCA calculations. We searched two databases for intervention-focused overviews published during 2023. Two reviewers conducted study selection and data extraction. We extracted overview characteristics and methods to handle overlap. For seven sampled overviews, we calculated overall and pairwise CCA across 16 scenarios, representing four matrix-construction assumptions. Of 193 included overviews, only 23 (11.9%) adhered to an overview-specific reporting guideline (e.g. PRIOR). Eighty-five (44.0%) did not address overlap; 14 (7.3%) only mentioned it in the discussion; and 94 (48.7%) incorporated it into methods or results (38 using CCA). Among the seven sampled overviews, CCA values varied depending on matrix-construction assumptions, ranging from 1.2% to 13.5% with the overall method and 0.0% to 15.7% with the pairwise method. CCA values may vary depending on the assumptions made during matrix construction, including scope, treatment of structural missingness, and handling of publication threads. This variability calls into question the uncritical use of current CCA thresholds and underscores the need for overview authors to report both overall and pairwise CCA calculations. Our preliminary guidance for transparently reporting matrix-construction assumptions may improve the accuracy and reproducibility of CCA assessments.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"348-364"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873615/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-13DOI: 10.1017/rsm.2025.10044
Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini
Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14-$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.
{"title":"Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction.","authors":"Antonio Sciurti, Giuseppe Migliara, Leonardo Maria Siena, Claudia Isonne, Maria Roberta De Blasiis, Alessandra Sinopoli, Jessica Iera, Carolina Marzuillo, Corrado De Vito, Paolo Villari, Valentina Baccolini","doi":"10.1017/rsm.2025.10044","DOIUrl":"10.1017/rsm.2025.10044","url":null,"abstract":"<p><p>Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact large language models (LLMs) offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B, and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value, and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14-$1.93 per review). Llama 3.1-8B and Gemma 2-9B were run locally in longer times (~4 hours max.) and were free to use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"332-347"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873614/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-17DOI: 10.1017/rsm.2025.10054
Delphine S Courvoisier, Diana Buitrago-Garcia, Clément P Buclin, Nils Bürgisser, Michele Iudici, Denis Mongin
Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.
{"title":"Beyond human gold standards: A multimodel framework for automated abstract classification and information extraction.","authors":"Delphine S Courvoisier, Diana Buitrago-Garcia, Clément P Buclin, Nils Bürgisser, Michele Iudici, Denis Mongin","doi":"10.1017/rsm.2025.10054","DOIUrl":"10.1017/rsm.2025.10054","url":null,"abstract":"<p><p>Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"365-377"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873610/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-13DOI: 10.1017/rsm.2025.10050
Juyoung Jung, Ariel M Aloe
Bayesian hierarchical models offer a principled framework for adjusting for study-level bias in meta-analysis, but their complexity and sensitivity to prior specifications necessitate a systematic framework for robust application. This study demonstrates the application of a Bayesian workflow to this challenge, comparing a standard random-effects model to a bias-adjustment model across a real-world dataset and a targeted simulation study. The workflow revealed a high sensitivity of results to the prior on bias probability, showing that while the simpler random-effects model had superior predictive accuracy as measured by the widely applicable information criterion, the bias-adjustment model successfully propagated uncertainty by producing wider, more conservative credible intervals. The simulation confirmed the model's ability to recover true parameters when priors were well-specified. These results establish the Bayesian workflow as a principled framework for diagnosing model sensitivities and ensuring the transparent application of complex bias-adjustment models in evidence synthesis.
{"title":"Bayesian workflow for bias-adjustment model in meta-analysis.","authors":"Juyoung Jung, Ariel M Aloe","doi":"10.1017/rsm.2025.10050","DOIUrl":"10.1017/rsm.2025.10050","url":null,"abstract":"<p><p>Bayesian hierarchical models offer a principled framework for adjusting for study-level bias in meta-analysis, but their complexity and sensitivity to prior specifications necessitate a systematic framework for robust application. This study demonstrates the application of a Bayesian workflow to this challenge, comparing a standard random-effects model to a bias-adjustment model across a real-world dataset and a targeted simulation study. The workflow revealed a high sensitivity of results to the prior on bias probability, showing that while the simpler random-effects model had superior predictive accuracy as measured by the widely applicable information criterion, the bias-adjustment model successfully propagated uncertainty by producing wider, more conservative credible intervals. The simulation confirmed the model's ability to recover true parameters when priors were well-specified. These results establish the Bayesian workflow as a principled framework for diagnosing model sensitivities and ensuring the transparent application of complex bias-adjustment models in evidence synthesis.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"293-313"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-11-13DOI: 10.1017/rsm.2025.10049
Michael Pearce, Shouhao Zhou
Ranking multiple interventions is a crucial task in network meta-analysis (NMA) to guide clinical and policy decisions. However, conventional ranking methods often oversimplify treatment distinctions, potentially yielding misleading conclusions due to inherent uncertainty in relative intervention effects. To address these limitations, we propose a novel Bayesian rank-clustering estimation approach, termed rank-clustering estimation (RaCE), specifically developed for NMA. Rather than identifying a single "best" intervention, RaCE enables the probabilistic clustering of interventions with similar effectiveness, offering a more nuanced and parsimonious interpretation. By decoupling the clustering procedure from the NMA modeling process, RaCE is a flexible and broadly applicable approach that can accommodate different types of outcomes (binary, continuous, and survival), modeling approaches (arm-based and contrast-based), and estimation frameworks (frequentist or Bayesian). Simulation studies demonstrate that RaCE effectively captures rank-clusters even under conditions of substantial uncertainty and overlapping intervention effects, providing more reasonable result interpretation than traditional single-ranking methods. We illustrate the practical utility of RaCE through an NMA application to frontline immunochemotherapies for follicular lymphoma, revealing clinically relevant clusters among treatments previously assumed to have distinct ranks. Overall, RaCE provides a valuable tool for researchers to enhance rank estimation and interpretability, facilitating evidence-based decision-making in complex intervention landscapes.
{"title":"RaCE: A rank-clustering estimation method for network meta-analysis.","authors":"Michael Pearce, Shouhao Zhou","doi":"10.1017/rsm.2025.10049","DOIUrl":"10.1017/rsm.2025.10049","url":null,"abstract":"<p><p>Ranking multiple interventions is a crucial task in network meta-analysis (NMA) to guide clinical and policy decisions. However, conventional ranking methods often oversimplify treatment distinctions, potentially yielding misleading conclusions due to inherent uncertainty in relative intervention effects. To address these limitations, we propose a novel Bayesian rank-clustering estimation approach, termed rank-clustering estimation (RaCE), specifically developed for NMA. Rather than identifying a single \"best\" intervention, RaCE enables the probabilistic clustering of interventions with similar effectiveness, offering a more nuanced and parsimonious interpretation. By decoupling the clustering procedure from the NMA modeling process, RaCE is a flexible and broadly applicable approach that can accommodate different types of outcomes (binary, continuous, and survival), modeling approaches (arm-based and contrast-based), and estimation frameworks (frequentist or Bayesian). Simulation studies demonstrate that RaCE effectively captures rank-clusters even under conditions of substantial uncertainty and overlapping intervention effects, providing more reasonable result interpretation than traditional single-ranking methods. We illustrate the practical utility of RaCE through an NMA application to frontline immunochemotherapies for follicular lymphoma, revealing clinically relevant clusters among treatments previously assumed to have distinct ranks. Overall, RaCE provides a valuable tool for researchers to enhance rank estimation and interpretability, facilitating evidence-based decision-making in complex intervention landscapes.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"314-331"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-10-23DOI: 10.1017/rsm.2025.10042
Romy Menghao Jia, Cindy Stern
Critical appraisal is a core component of JBI qualitative evidence synthesis, offering insights into the quality of included studies and their potential influence on synthesized findings. However, limited guidance exists on whether, when, and how to exclude studies based on appraisal results. This study examined the methods used in JBI qualitative systematic reviews and the implications for synthesized findings. In this study, a systematic analysis of qualitative reviews published between 2018 and 2022 in JBI Evidence Synthesis was conducted. Data on decisions and their justifications were extracted from reviews and protocols. Descriptive and content analysis explored variations in the reported methods. Forty-five reviews were included. Approaches reported varied widely: 24% of reviews included all studies regardless of quality, while others applied exclusion criteria (36%), cutoff scores (11%), or multiple methods (9%). Limited justifications were provided for the approaches. Few reviews cited methodological references to support their decisions. Review authors reported their approach in various sections of the review, with inconsistencies identified in 18% of the sample. In addition, unclear or ambiguous descriptions were also identified in 18% of the included reviews. No clear differences were observed in ConQual scores between reviews that excluded studies and those that did not. Overall, the variability raises concerns about the credibility, transparency, and reproducibility of JBI qualitative systematic reviews. Decisions regarding the inclusion or exclusion of studies based on critical appraisal need to be clearly justified and consistently reported. Further methodological research is needed to support rigorous decision-making and to improve the reliability of synthesized findings.
{"title":"The inclusion or exclusion of studies based on critical appraisal results in JBI qualitative systematic reviews: An analysis of practices.","authors":"Romy Menghao Jia, Cindy Stern","doi":"10.1017/rsm.2025.10042","DOIUrl":"10.1017/rsm.2025.10042","url":null,"abstract":"<p><p>Critical appraisal is a core component of JBI qualitative evidence synthesis, offering insights into the quality of included studies and their potential influence on synthesized findings. However, limited guidance exists on whether, when, and how to exclude studies based on appraisal results. This study examined the methods used in JBI qualitative systematic reviews and the implications for synthesized findings. In this study, a systematic analysis of qualitative reviews published between 2018 and 2022 in <i>JBI Evidence Synthesis</i> was conducted. Data on decisions and their justifications were extracted from reviews and protocols. Descriptive and content analysis explored variations in the reported methods. Forty-five reviews were included. Approaches reported varied widely: 24% of reviews included all studies regardless of quality, while others applied exclusion criteria (36%), cutoff scores (11%), or multiple methods (9%). Limited justifications were provided for the approaches. Few reviews cited methodological references to support their decisions. Review authors reported their approach in various sections of the review, with inconsistencies identified in 18% of the sample. In addition, unclear or ambiguous descriptions were also identified in 18% of the included reviews. No clear differences were observed in ConQual scores between reviews that excluded studies and those that did not. Overall, the variability raises concerns about the credibility, transparency, and reproducibility of JBI qualitative systematic reviews. Decisions regarding the inclusion or exclusion of studies based on critical appraisal need to be clearly justified and consistently reported. Further methodological research is needed to support rigorous decision-making and to improve the reliability of synthesized findings.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 2","pages":"277-292"},"PeriodicalIF":6.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873616/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146111558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}