Pub Date : 2026-01-01Epub Date: 2025-10-27DOI: 10.1017/rsm.2025.10032
Carole Lunny, Nityanand Jain, Tina Nazari, Melodi Kosaner-Kließ, Lucas Santos, Ian Goodman, Alaa A M Osman, Stefano Berrone, Mohammad Najm Dadam, Connor T A Brenna, Heba Hussein, Gioia Dahdal, Diana Cespedes A, Nicola Ferri, Salmaan Kanji, Yuan Chi, Dawid Pieper, Beverly Shea, Amanda Parker, Dipika Neupane, Paul A Khan, Daniella Rangira, Kat Kolaski, Ben Ridley, Amina Berour, Kevin Sun, Radin Hamidi Rad, Zihui Ouyang, Emma K Reid, Iván Pérez-Neri, Sanabel O Barakat, Silvia Bargeri, Silvia Gianola, Greta Castellini, Sera Whitelaw, Adrienne Stevens, Shailesh B Kolekar, Kristy Wong, Paityn Major, Ebrahim Bagheri, Andrea C Tricco
AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews, version 2) and ROBIS are tools used to assess the methodological quality and the risk of bias in a systematic review (SR). We applied AMSTAR-2 and ROBIS to a sample of 200 published SRs. We investigated the overlap in their methodological constructs, responses by item, and overall, percentage agreement, direction of effect, and timing of assessments. AMSTAR-2 contains 16 items and ROBIS 24 items. Three items in AMSTAR-2 and nine in ROBIS did not overlap in construct. Of the 200 SRs, 73% were low or critically low quality using AMSTAR-2, and 81% had a high risk of bias using ROBIS. The median time to complete AMSTAR-2 and ROBIS was 51 and 64 minutes, respectively. When assessment times were calibrated to the number of items in each tool, each item took an average of 3.2 minutes per item for AMSTAR-2 compared to 2.7 minutes for ROBIS. Nine percent of SRs had opposing ratings (i.e., AMSTAR-2 was high quality while ROBIS was high risk). In both tools, three-quarters of items showed more than 70% agreement between raters after extensive training and piloting. AMSTAR-2 and ROBIS provide complementary rather than interchangeable assessments of systematic reviews. AMSTAR-2 may be preferable when efficiency is prioritized and methodological rigour is the focus, whereas ROBIS offers a deeper examination of potential biases and external validity. Given the widespread reliance on systematic reviews for policy and practice, selecting the appropriate appraisal tool remains crucial. Future research should explore strategies to integrate the strengths of both instruments while minimizing the burden on assessors.
{"title":"Exploring the methodological quality and risk of bias in 200 systematic reviews: A comparative study of ROBIS and AMSTAR-2 tools.","authors":"Carole Lunny, Nityanand Jain, Tina Nazari, Melodi Kosaner-Kließ, Lucas Santos, Ian Goodman, Alaa A M Osman, Stefano Berrone, Mohammad Najm Dadam, Connor T A Brenna, Heba Hussein, Gioia Dahdal, Diana Cespedes A, Nicola Ferri, Salmaan Kanji, Yuan Chi, Dawid Pieper, Beverly Shea, Amanda Parker, Dipika Neupane, Paul A Khan, Daniella Rangira, Kat Kolaski, Ben Ridley, Amina Berour, Kevin Sun, Radin Hamidi Rad, Zihui Ouyang, Emma K Reid, Iván Pérez-Neri, Sanabel O Barakat, Silvia Bargeri, Silvia Gianola, Greta Castellini, Sera Whitelaw, Adrienne Stevens, Shailesh B Kolekar, Kristy Wong, Paityn Major, Ebrahim Bagheri, Andrea C Tricco","doi":"10.1017/rsm.2025.10032","DOIUrl":"10.1017/rsm.2025.10032","url":null,"abstract":"<p><p>AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews, version 2) and ROBIS are tools used to assess the methodological quality and the risk of bias in a systematic review (SR). We applied AMSTAR-2 and ROBIS to a sample of 200 published SRs. We investigated the overlap in their methodological constructs, responses by item, and overall, percentage agreement, direction of effect, and timing of assessments. AMSTAR-2 contains 16 items and ROBIS 24 items. Three items in AMSTAR-2 and nine in ROBIS did not overlap in construct. Of the 200 SRs, 73% were low or critically low quality using AMSTAR-2, and 81% had a high risk of bias using ROBIS. The median time to complete AMSTAR-2 and ROBIS was 51 and 64 minutes, respectively. When assessment times were calibrated to the number of items in each tool, each item took an average of 3.2 minutes per item for AMSTAR-2 compared to 2.7 minutes for ROBIS. Nine percent of SRs had opposing ratings (i.e., AMSTAR-2 was high quality while ROBIS was high risk). In both tools, three-quarters of items showed more than 70% agreement between raters after extensive training and piloting. AMSTAR-2 and ROBIS provide complementary rather than interchangeable assessments of systematic reviews. AMSTAR-2 may be preferable when efficiency is prioritized and methodological rigour is the focus, whereas ROBIS offers a deeper examination of potential biases and external validity. Given the widespread reliance on systematic reviews for policy and practice, selecting the appropriate appraisal tool remains crucial. Future research should explore strategies to integrate the strengths of both instruments while minimizing the burden on assessors.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 1","pages":"63-92"},"PeriodicalIF":6.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823211/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-09-17DOI: 10.1017/rsm.2025.10030
Yuki Kataoka, Tomohiro Takayama, Keisuke Yoshimura, Ryuhei So, Yasushi Tsujimoto, Yosuke Yamagishi, Shiro Takagi, Yuki Furukawa, Masatsugu Sakata, Đorđe Bašić, Andrea Cipriani, Pim Cuijpers, Eirini Karyotaki, Mathias Harrer, Stefan Leucht, Ava Homiar, Edoardo G Ostinelli, Clara Miguel, Alessandro Rodolico, Toshi A Furukawa
Large language models have shown promise for automating data extraction (DE) in systematic reviews (SRs), but most existing approaches require manual interaction. We developed an open-source system using GPT-4o to automatically extract data with no human intervention during the extraction process. We developed the system on a dataset of 290 randomized controlled trials (RCTs) from a published SR about cognitive behavioral therapy for insomnia. We evaluated the system on two other datasets: 5 RCTs from an updated search for the same review and 10 RCTs used in a separate published study that had also evaluated automated DE. We developed the best approach across all variables in the development dataset using GPT-4o. The performance in the updated-search dataset using o3 was 74.9% sensitivity, 76.7% specificity, 75.7 precision, 93.5% variable detection comprehensiveness, and 75.3% accuracy. In both datasets, accuracy was higher for string variables (e.g., country, study design, drug names, and outcome definitions) compared with numeric variables. In the third external validation dataset, GPT-4o showed a lower performance with a mean accuracy of 84.4% compared with the previous study. However, by adjusting our DE method, while maintaining the same prompting technique, we achieved a mean accuracy of 96.3%, which was comparable to the previous manual extraction study. Our system shows potential for assisting the DE of string variables alongside a human reviewer. However, it cannot yet replace humans for numeric DE. Further evaluation across diverse review contexts is needed to establish broader applicability.
{"title":"Automating the data extraction process for systematic reviews using GPT-4o and o3.","authors":"Yuki Kataoka, Tomohiro Takayama, Keisuke Yoshimura, Ryuhei So, Yasushi Tsujimoto, Yosuke Yamagishi, Shiro Takagi, Yuki Furukawa, Masatsugu Sakata, Đorđe Bašić, Andrea Cipriani, Pim Cuijpers, Eirini Karyotaki, Mathias Harrer, Stefan Leucht, Ava Homiar, Edoardo G Ostinelli, Clara Miguel, Alessandro Rodolico, Toshi A Furukawa","doi":"10.1017/rsm.2025.10030","DOIUrl":"10.1017/rsm.2025.10030","url":null,"abstract":"<p><p>Large language models have shown promise for automating data extraction (DE) in systematic reviews (SRs), but most existing approaches require manual interaction. We developed an open-source system using GPT-4o to automatically extract data with no human intervention during the extraction process. We developed the system on a dataset of 290 randomized controlled trials (RCTs) from a published SR about cognitive behavioral therapy for insomnia. We evaluated the system on two other datasets: 5 RCTs from an updated search for the same review and 10 RCTs used in a separate published study that had also evaluated automated DE. We developed the best approach across all variables in the development dataset using GPT-4o. The performance in the updated-search dataset using o3 was 74.9% sensitivity, 76.7% specificity, 75.7 precision, 93.5% variable detection comprehensiveness, and 75.3% accuracy. In both datasets, accuracy was higher for string variables (e.g., country, study design, drug names, and outcome definitions) compared with numeric variables. In the third external validation dataset, GPT-4o showed a lower performance with a mean accuracy of 84.4% compared with the previous study. However, by adjusting our DE method, while maintaining the same prompting technique, we achieved a mean accuracy of 96.3%, which was comparable to the previous manual extraction study. Our system shows potential for assisting the DE of string variables alongside a human reviewer. However, it cannot yet replace humans for numeric DE. Further evaluation across diverse review contexts is needed to establish broader applicability.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 1","pages":"42-62"},"PeriodicalIF":6.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823200/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-10-02DOI: 10.1017/rsm.2025.10035
Weilun Wu, Jianhua Duan, W Robert Reed, Elizabeth Tipton
This study analyzes 1,000 meta-analyses drawn from 10 disciplines-including medicine, psychology, education, biology, and economics-to document and compare methodological practices across fields. We find large differences in the size of meta-analyses, the number of effect sizes per study, and the types of effect sizes used. Disciplines also vary in their use of unpublished studies, the frequency and type of tests for publication bias, and whether they attempt to correct for it. Notably, many meta-analyses include multiple effect sizes from the same study, yet fail to account for statistical dependence in their analyses. We document the limited use of advanced methods-such as multilevel models and cluster-adjusted standard errors-that can accommodate dependent data structures. Correlations are frequently used as effect sizes in some disciplines, yet researchers often fail to address the methodological issues this introduces, including biased weighting and misleading tests for publication bias. We also find that meta-regression is underutilized, even when sample sizes are large enough to support it. This work serves as a resource for researchers conducting their first meta-analyses, as a benchmark for researchers designing simulation experiments, and as a reference for applied meta-analysts aiming to improve their methodological practices.
{"title":"What can we learn from 1,000 meta-analyses across 10 different disciplines?","authors":"Weilun Wu, Jianhua Duan, W Robert Reed, Elizabeth Tipton","doi":"10.1017/rsm.2025.10035","DOIUrl":"10.1017/rsm.2025.10035","url":null,"abstract":"<p><p>This study analyzes 1,000 meta-analyses drawn from 10 disciplines-including medicine, psychology, education, biology, and economics-to document and compare methodological practices across fields. We find large differences in the size of meta-analyses, the number of effect sizes per study, and the types of effect sizes used. Disciplines also vary in their use of unpublished studies, the frequency and type of tests for publication bias, and whether they attempt to correct for it. Notably, many meta-analyses include multiple effect sizes from the same study, yet fail to account for statistical dependence in their analyses. We document the limited use of advanced methods-such as multilevel models and cluster-adjusted standard errors-that can accommodate dependent data structures. Correlations are frequently used as effect sizes in some disciplines, yet researchers often fail to address the methodological issues this introduces, including biased weighting and misleading tests for publication bias. We also find that meta-regression is underutilized, even when sample sizes are large enough to support it. This work serves as a resource for researchers conducting their first meta-analyses, as a benchmark for researchers designing simulation experiments, and as a reference for applied meta-analysts aiming to improve their methodological practices.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 1","pages":"123-156"},"PeriodicalIF":6.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823205/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-10-15DOI: 10.1017/rsm.2025.10038
Keith Chan, Sarah Goring, Kabirraaj Toor, Murat Kurt, Andriy Moshyk, Jeroen Jansen
In many areas of oncology, cancer drugs are now associated with long-term survivorship and mixture cure models (MCM) are increasingly being used for survival analysis. The objective of this article was to propose a methodology for conducting network meta-analysis (NMA) of MCM. This method was illustrated through a case study evaluating recurrence-free survival (RFS) with adjuvant therapy for stage III/IV resected melanoma. For the case study, the MCM NMA was conducted by: (1) fitting MCMs to each trial included within the network of evidence; and (2) incorporating the parameters of the MCMs into a multivariate NMA. Outputs included relative effect estimates for the MCM NMA as well as absolute estimates of survival (RFS), modeled within the Bayesian multivariate NMA, by incorporating absolute baseline effects of the reference treatment. The case study was intended for illustrative purposes of the MCM NMA methodology and is not meant for clinical interpretation. The case study demonstrated the feasibility of conducting an MCM NMA and highlighted key issues and considerations when conducting such analyses, including plausibility of cure, maturity of data, process for model selection, and the presentation and interpretation of results. MCM NMA provides a method of comparative survival that acknowledges the benefit newer treatments may confer on a subset of patients, resulting in long-term survival and reflection of this survival in extrapolation. In the future, this method may provide an additional metric to compare treatments that is of value to patients.
{"title":"Incorporating the possibility of cure into network meta-analyses: A case study from resected Stage III/IV melanoma.","authors":"Keith Chan, Sarah Goring, Kabirraaj Toor, Murat Kurt, Andriy Moshyk, Jeroen Jansen","doi":"10.1017/rsm.2025.10038","DOIUrl":"10.1017/rsm.2025.10038","url":null,"abstract":"<p><p>In many areas of oncology, cancer drugs are now associated with long-term survivorship and mixture cure models (MCM) are increasingly being used for survival analysis. The objective of this article was to propose a methodology for conducting network meta-analysis (NMA) of MCM. This method was illustrated through a case study evaluating recurrence-free survival (RFS) with adjuvant therapy for stage III/IV resected melanoma. For the case study, the MCM NMA was conducted by: (1) fitting MCMs to each trial included within the network of evidence; and (2) incorporating the parameters of the MCMs into a multivariate NMA. Outputs included relative effect estimates for the MCM NMA as well as absolute estimates of survival (RFS), modeled within the Bayesian multivariate NMA, by incorporating absolute baseline effects of the reference treatment. The case study was intended for illustrative purposes of the MCM NMA methodology and is not meant for clinical interpretation. The case study demonstrated the feasibility of conducting an MCM NMA and highlighted key issues and considerations when conducting such analyses, including plausibility of cure, maturity of data, process for model selection, and the presentation and interpretation of results. MCM NMA provides a method of comparative survival that acknowledges the benefit newer treatments may confer on a subset of patients, resulting in long-term survival and reflection of this survival in extrapolation. In the future, this method may provide an additional metric to compare treatments that is of value to patients.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"17 1","pages":"157-169"},"PeriodicalIF":6.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823198/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-09-11DOI: 10.1017/rsm.2025.10031
Simona Emilova Doneva, Shirin de Viragh, Hanna Hubarava, Stefan Schandelmaier, Matthias Briel, Benjamin Victor Ineichen
screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.
{"title":"StudyTypeTeller-Large language models to automatically classify research study types for systematic reviews.","authors":"Simona Emilova Doneva, Shirin de Viragh, Hanna Hubarava, Stefan Schandelmaier, Matthias Briel, Benjamin Victor Ineichen","doi":"10.1017/rsm.2025.10031","DOIUrl":"10.1017/rsm.2025.10031","url":null,"abstract":"<p><p>screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1005-1024"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657658/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-06-23DOI: 10.1017/rsm.2025.10014
Takehiko Oami, Yohei Okada, Taka-Aki Nakada
Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs-GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B-for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A post hoc analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.
{"title":"Optimal large language models to screen citations for systematic reviews.","authors":"Takehiko Oami, Yohei Okada, Taka-Aki Nakada","doi":"10.1017/rsm.2025.10014","DOIUrl":"10.1017/rsm.2025.10014","url":null,"abstract":"<p><p>Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs-GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B-for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A <i>post hoc</i> analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"859-875"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657656/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network meta-analysis (NMA) is becoming increasingly important, especially in the field of medicine, as it allows for comparisons across multiple trials with different interventions. For time-to-event data, that is, survival data, traditional NMA based on the proportional hazards (PH) assumption simply synthesizes reported hazard ratios (HRs). Novel methods for NMA based on the non-PH assumption have been proposed and implemented using R software. However, these methods often involve complex methodologies and require advanced programming skills, creating a barrier for many researchers. Therefore, we developed an R Shiny tool, NMAsurv (https://psurvivala.shinyapps.io/NMAsurv/). NMAsurv allows users with little or zero background in R to conduct survival-data-based NMA effortlessly. The tool supports various functions such as drawing network plots, testing the PH assumption, and building NMA models. Users can input either reconstructed pseudo-individual participant data or aggregated data. NMAsurv offers a user-friendly interface for extracting parameter estimations from various NMA models, including fractional polynomial, piecewise exponential models, parametric survival models, Cox PH model, and generalized gamma model. Additionally, it enables users to effortlessly create survival and HR plots. All operations can be performed by an intuitive "point-and-click" interface. In this study, we introduce all the functionalities and features of NMAsurv and demonstrate its application using a real-world NMA example.
{"title":"NMAsurv: An R Shiny application for network meta-analysis based on survival data.","authors":"Taihang Shao, Mingye Zhao, Fenghao Shi, Mingjun Rui, Wenxi Tang","doi":"10.1017/rsm.2025.10020","DOIUrl":"10.1017/rsm.2025.10020","url":null,"abstract":"<p><p>Network meta-analysis (NMA) is becoming increasingly important, especially in the field of medicine, as it allows for comparisons across multiple trials with different interventions. For time-to-event data, that is, survival data, traditional NMA based on the proportional hazards (PH) assumption simply synthesizes reported hazard ratios (HRs). Novel methods for NMA based on the non-PH assumption have been proposed and implemented using R software. However, these methods often involve complex methodologies and require advanced programming skills, creating a barrier for many researchers. Therefore, we developed an R Shiny tool, NMAsurv (https://psurvivala.shinyapps.io/NMAsurv/). NMAsurv allows users with little or zero background in R to conduct survival-data-based NMA effortlessly. The tool supports various functions such as drawing network plots, testing the PH assumption, and building NMA models. Users can input either reconstructed pseudo-individual participant data or aggregated data. NMAsurv offers a user-friendly interface for extracting parameter estimations from various NMA models, including fractional polynomial, piecewise exponential models, parametric survival models, Cox PH model, and generalized gamma model. Additionally, it enables users to effortlessly create survival and HR plots. All operations can be performed by an intuitive \"point-and-click\" interface. In this study, we introduce all the functionalities and features of NMAsurv and demonstrate its application using a real-world NMA example.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1042-1056"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657653/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-09-05DOI: 10.1017/rsm.2025.10033
Marwin Weber, Simon Lewin, Joerg J Meerpohl, Heather Menzies Munthe-Kaas, Rigmor Berg, Andrew Booth, Claire Glenton, Jane Noyes, Ingrid Toews
Qualitative research addresses important healthcare questions, including patients' experiences with interventions. Qualitative evidence syntheses combine findings from individual studies and are increasingly used to inform health guidelines. However, dissemination bias-selective non-dissemination of studies or findings-may distort the body of evidence. This study examined reasons for the non-dissemination of qualitative studies. We identified conference abstracts reporting qualitative, health-related studies. We invited authors to answer a survey containing quantitative and qualitative questions. We performed descriptive analyses on the quantitative data and inductive thematic analysis on the qualitative data. Most of the 142 respondents were female, established researchers. About a third reported that their study had not been published in full after their conference presentation. The main reasons were time constraints, career changes, and a lack of interest. Few indicated non-publication due to the nature of the study findings. Decisions not to publish were largely made by author teams. Half of the 72% who published their study reported that all findings were included in the publication. This study highlights researchers' reasons for non-dissemination of qualitative research. One-third of studies presented as conference abstracts remained unpublished, but non-dissemination was rarely linked to the study findings. Further research is needed to understand the systematic non-dissemination of qualitative studies.
{"title":"What happens to qualitative studies initially presented as conference abstracts: A survey among study authors.","authors":"Marwin Weber, Simon Lewin, Joerg J Meerpohl, Heather Menzies Munthe-Kaas, Rigmor Berg, Andrew Booth, Claire Glenton, Jane Noyes, Ingrid Toews","doi":"10.1017/rsm.2025.10033","DOIUrl":"10.1017/rsm.2025.10033","url":null,"abstract":"<p><p>Qualitative research addresses important healthcare questions, including patients' experiences with interventions. Qualitative evidence syntheses combine findings from individual studies and are increasingly used to inform health guidelines. However, dissemination bias-selective non-dissemination of studies or findings-may distort the body of evidence. This study examined reasons for the non-dissemination of qualitative studies. We identified conference abstracts reporting qualitative, health-related studies. We invited authors to answer a survey containing quantitative and qualitative questions. We performed descriptive analyses on the quantitative data and inductive thematic analysis on the qualitative data. Most of the 142 respondents were female, established researchers. About a third reported that their study had not been published in full after their conference presentation. The main reasons were time constraints, career changes, and a lack of interest. Few indicated non-publication due to the nature of the study findings. Decisions not to publish were largely made by author teams. Half of the 72% who published their study reported that all findings were included in the publication. This study highlights researchers' reasons for non-dissemination of qualitative research. One-third of studies presented as conference abstracts remained unpublished, but non-dissemination was rarely linked to the study findings. Further research is needed to understand the systematic non-dissemination of qualitative studies.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1025-1034"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657646/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-08-01DOI: 10.1017/rsm.2025.10026
Gerta Rücker, Guido Schwarzer
For network meta-analysis (NMA), we usually assume that the treatment arms are independent within each included trial. This assumption is justified for parallel design trials and leads to a property we call consistency of variances for both multi-arm trials and NMA estimates. However, the assumption is violated for trials with correlated arms, for example, split-body trials. For multi-arm trials with correlated arms, the variance of a contrast is not the sum of the arm-based variances, but comes with a correlation term. This may lead to violations of variance consistency, and the inconsistency of variances may even propagate to the NMA estimates. We explain this using a geometric analogy where three-arm trials correspond to triangles and four-arm trials correspond to tetrahedrons. We also investigate which information has to be extracted for a multi-arm trial with correlated arms and provide an algorithm to analyze NMAs including such trials.
{"title":"Trials and triangles: Network meta-analysis of multi-arm trials with correlated arms.","authors":"Gerta Rücker, Guido Schwarzer","doi":"10.1017/rsm.2025.10026","DOIUrl":"10.1017/rsm.2025.10026","url":null,"abstract":"<p><p>For network meta-analysis (NMA), we usually assume that the treatment arms are independent within each included trial. This assumption is justified for parallel design trials and leads to a property we call consistency of variances for both multi-arm trials and NMA estimates. However, the assumption is violated for trials with correlated arms, for example, split-body trials. For multi-arm trials with correlated arms, the variance of a contrast is not the sum of the arm-based variances, but comes with a correlation term. This may lead to violations of variance consistency, and the inconsistency of variances may even propagate to the NMA estimates. We explain this using a geometric analogy where three-arm trials correspond to triangles and four-arm trials correspond to tetrahedrons. We also investigate which information has to be extracted for a multi-arm trial with correlated arms and provide an algorithm to analyze NMAs including such trials.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"961-974"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-07-10DOI: 10.1017/rsm.2025.10018
Barbara Nussbaumer-Streit, Dominic Ledinger, Christina Kien, Irma Klerings, Emma Persad, Andrea Chapman, Claus Nowak, Arianna Gadinger, Lisa Affengruber, Maureen Smith, Gerald Gartlehner, Ursula Griebler
Background: Involving knowledge users (KUs) such as patients, clinicians, or health policymakers is particularly relevant when conducting rapid reviews (RRs), as they should be tailored to decision-makers' needs. However, little is known about how common KU involvement currently is in RRs.
Objectives: We wanted to assess the proportion of KU involvement reported in recently published RRs (2021 onwards), which groups of KUs were involved in each phase of the RR process, to what extent, and which factors were associated with KU involvement in RRs.
Methods: We conducted a meta-research cross-sectional study. A systematic literature search in Ovid MEDLINE and Epistemonikos in January 2024 identified 2,493 unique records. We dually screened the identified records (partly with assistance from an artificial intelligence (AI)-based application) until we reached the a priori calculated sample size of 104 RRs. We dually extracted data and analyzed it descriptively.
Results: The proportion of RRs that reported KU involvement was 19% (95% confidence interval [CI]: 12%-28%). Most often, KUs were involved during the initial preparation of the RR, the systematic searches, and the interpretation and dissemination of results. Researchers/content experts and public/patient partners were the KU groups most often involved. KU involvement was more common in RRs focusing on patient involvement/shared decision-making, having a published protocol, and being commissioned.
Conclusions: Reporting KU involvement in published RRs is uncommon and often vague. Future research should explore barriers and facilitators for KU involvement and its reporting in RRs. Guidance regarding reporting on KU involvement in RRs is needed.
{"title":"Knowledge user involvement is still uncommon in published rapid reviews-a meta-research cross-sectional study.","authors":"Barbara Nussbaumer-Streit, Dominic Ledinger, Christina Kien, Irma Klerings, Emma Persad, Andrea Chapman, Claus Nowak, Arianna Gadinger, Lisa Affengruber, Maureen Smith, Gerald Gartlehner, Ursula Griebler","doi":"10.1017/rsm.2025.10018","DOIUrl":"10.1017/rsm.2025.10018","url":null,"abstract":"<p><strong>Background: </strong>Involving knowledge users (KUs) such as patients, clinicians, or health policymakers is particularly relevant when conducting rapid reviews (RRs), as they should be tailored to decision-makers' needs. However, little is known about how common KU involvement currently is in RRs.</p><p><strong>Objectives: </strong>We wanted to assess the proportion of KU involvement reported in recently published RRs (2021 onwards), which groups of KUs were involved in each phase of the RR process, to what extent, and which factors were associated with KU involvement in RRs.</p><p><strong>Methods: </strong>We conducted a meta-research cross-sectional study. A systematic literature search in Ovid MEDLINE and Epistemonikos in January 2024 identified 2,493 unique records. We dually screened the identified records (partly with assistance from an artificial intelligence (AI)-based application) until we reached the a priori calculated sample size of 104 RRs. We dually extracted data and analyzed it descriptively.</p><p><strong>Results: </strong>The proportion of RRs that reported KU involvement was 19% (95% confidence interval [CI]: 12%-28%). Most often, KUs were involved during the initial preparation of the RR, the systematic searches, and the interpretation and dissemination of results. Researchers/content experts and public/patient partners were the KU groups most often involved. KU involvement was more common in RRs focusing on patient involvement/shared decision-making, having a published protocol, and being commissioned.</p><p><strong>Conclusions: </strong>Reporting KU involvement in published RRs is uncommon and often vague. Future research should explore barriers and facilitators for KU involvement and its reporting in RRs. Guidance regarding reporting on KU involvement in RRs is needed.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"876-899"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}