Pub Date : 2025-11-01Epub Date: 2025-09-11DOI: 10.1017/rsm.2025.10031
Simona Emilova Doneva, Shirin de Viragh, Hanna Hubarava, Stefan Schandelmaier, Matthias Briel, Benjamin Victor Ineichen
screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.
{"title":"StudyTypeTeller-Large language models to automatically classify research study types for systematic reviews.","authors":"Simona Emilova Doneva, Shirin de Viragh, Hanna Hubarava, Stefan Schandelmaier, Matthias Briel, Benjamin Victor Ineichen","doi":"10.1017/rsm.2025.10031","DOIUrl":"10.1017/rsm.2025.10031","url":null,"abstract":"<p><p>screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1005-1024"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657658/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-06-23DOI: 10.1017/rsm.2025.10014
Takehiko Oami, Yohei Okada, Taka-Aki Nakada
Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs-GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B-for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A post hoc analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.
{"title":"Optimal large language models to screen citations for systematic reviews.","authors":"Takehiko Oami, Yohei Okada, Taka-Aki Nakada","doi":"10.1017/rsm.2025.10014","DOIUrl":"10.1017/rsm.2025.10014","url":null,"abstract":"<p><p>Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs-GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B-for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A <i>post hoc</i> analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"859-875"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657656/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network meta-analysis (NMA) is becoming increasingly important, especially in the field of medicine, as it allows for comparisons across multiple trials with different interventions. For time-to-event data, that is, survival data, traditional NMA based on the proportional hazards (PH) assumption simply synthesizes reported hazard ratios (HRs). Novel methods for NMA based on the non-PH assumption have been proposed and implemented using R software. However, these methods often involve complex methodologies and require advanced programming skills, creating a barrier for many researchers. Therefore, we developed an R Shiny tool, NMAsurv (https://psurvivala.shinyapps.io/NMAsurv/). NMAsurv allows users with little or zero background in R to conduct survival-data-based NMA effortlessly. The tool supports various functions such as drawing network plots, testing the PH assumption, and building NMA models. Users can input either reconstructed pseudo-individual participant data or aggregated data. NMAsurv offers a user-friendly interface for extracting parameter estimations from various NMA models, including fractional polynomial, piecewise exponential models, parametric survival models, Cox PH model, and generalized gamma model. Additionally, it enables users to effortlessly create survival and HR plots. All operations can be performed by an intuitive "point-and-click" interface. In this study, we introduce all the functionalities and features of NMAsurv and demonstrate its application using a real-world NMA example.
{"title":"NMAsurv: An R Shiny application for network meta-analysis based on survival data.","authors":"Taihang Shao, Mingye Zhao, Fenghao Shi, Mingjun Rui, Wenxi Tang","doi":"10.1017/rsm.2025.10020","DOIUrl":"10.1017/rsm.2025.10020","url":null,"abstract":"<p><p>Network meta-analysis (NMA) is becoming increasingly important, especially in the field of medicine, as it allows for comparisons across multiple trials with different interventions. For time-to-event data, that is, survival data, traditional NMA based on the proportional hazards (PH) assumption simply synthesizes reported hazard ratios (HRs). Novel methods for NMA based on the non-PH assumption have been proposed and implemented using R software. However, these methods often involve complex methodologies and require advanced programming skills, creating a barrier for many researchers. Therefore, we developed an R Shiny tool, NMAsurv (https://psurvivala.shinyapps.io/NMAsurv/). NMAsurv allows users with little or zero background in R to conduct survival-data-based NMA effortlessly. The tool supports various functions such as drawing network plots, testing the PH assumption, and building NMA models. Users can input either reconstructed pseudo-individual participant data or aggregated data. NMAsurv offers a user-friendly interface for extracting parameter estimations from various NMA models, including fractional polynomial, piecewise exponential models, parametric survival models, Cox PH model, and generalized gamma model. Additionally, it enables users to effortlessly create survival and HR plots. All operations can be performed by an intuitive \"point-and-click\" interface. In this study, we introduce all the functionalities and features of NMAsurv and demonstrate its application using a real-world NMA example.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1042-1056"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657653/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-09-05DOI: 10.1017/rsm.2025.10033
Marwin Weber, Simon Lewin, Joerg J Meerpohl, Heather Menzies Munthe-Kaas, Rigmor Berg, Andrew Booth, Claire Glenton, Jane Noyes, Ingrid Toews
Qualitative research addresses important healthcare questions, including patients' experiences with interventions. Qualitative evidence syntheses combine findings from individual studies and are increasingly used to inform health guidelines. However, dissemination bias-selective non-dissemination of studies or findings-may distort the body of evidence. This study examined reasons for the non-dissemination of qualitative studies. We identified conference abstracts reporting qualitative, health-related studies. We invited authors to answer a survey containing quantitative and qualitative questions. We performed descriptive analyses on the quantitative data and inductive thematic analysis on the qualitative data. Most of the 142 respondents were female, established researchers. About a third reported that their study had not been published in full after their conference presentation. The main reasons were time constraints, career changes, and a lack of interest. Few indicated non-publication due to the nature of the study findings. Decisions not to publish were largely made by author teams. Half of the 72% who published their study reported that all findings were included in the publication. This study highlights researchers' reasons for non-dissemination of qualitative research. One-third of studies presented as conference abstracts remained unpublished, but non-dissemination was rarely linked to the study findings. Further research is needed to understand the systematic non-dissemination of qualitative studies.
{"title":"What happens to qualitative studies initially presented as conference abstracts: A survey among study authors.","authors":"Marwin Weber, Simon Lewin, Joerg J Meerpohl, Heather Menzies Munthe-Kaas, Rigmor Berg, Andrew Booth, Claire Glenton, Jane Noyes, Ingrid Toews","doi":"10.1017/rsm.2025.10033","DOIUrl":"10.1017/rsm.2025.10033","url":null,"abstract":"<p><p>Qualitative research addresses important healthcare questions, including patients' experiences with interventions. Qualitative evidence syntheses combine findings from individual studies and are increasingly used to inform health guidelines. However, dissemination bias-selective non-dissemination of studies or findings-may distort the body of evidence. This study examined reasons for the non-dissemination of qualitative studies. We identified conference abstracts reporting qualitative, health-related studies. We invited authors to answer a survey containing quantitative and qualitative questions. We performed descriptive analyses on the quantitative data and inductive thematic analysis on the qualitative data. Most of the 142 respondents were female, established researchers. About a third reported that their study had not been published in full after their conference presentation. The main reasons were time constraints, career changes, and a lack of interest. Few indicated non-publication due to the nature of the study findings. Decisions not to publish were largely made by author teams. Half of the 72% who published their study reported that all findings were included in the publication. This study highlights researchers' reasons for non-dissemination of qualitative research. One-third of studies presented as conference abstracts remained unpublished, but non-dissemination was rarely linked to the study findings. Further research is needed to understand the systematic non-dissemination of qualitative studies.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1025-1034"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657646/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-08-01DOI: 10.1017/rsm.2025.10026
Gerta Rücker, Guido Schwarzer
For network meta-analysis (NMA), we usually assume that the treatment arms are independent within each included trial. This assumption is justified for parallel design trials and leads to a property we call consistency of variances for both multi-arm trials and NMA estimates. However, the assumption is violated for trials with correlated arms, for example, split-body trials. For multi-arm trials with correlated arms, the variance of a contrast is not the sum of the arm-based variances, but comes with a correlation term. This may lead to violations of variance consistency, and the inconsistency of variances may even propagate to the NMA estimates. We explain this using a geometric analogy where three-arm trials correspond to triangles and four-arm trials correspond to tetrahedrons. We also investigate which information has to be extracted for a multi-arm trial with correlated arms and provide an algorithm to analyze NMAs including such trials.
{"title":"Trials and triangles: Network meta-analysis of multi-arm trials with correlated arms.","authors":"Gerta Rücker, Guido Schwarzer","doi":"10.1017/rsm.2025.10026","DOIUrl":"10.1017/rsm.2025.10026","url":null,"abstract":"<p><p>For network meta-analysis (NMA), we usually assume that the treatment arms are independent within each included trial. This assumption is justified for parallel design trials and leads to a property we call consistency of variances for both multi-arm trials and NMA estimates. However, the assumption is violated for trials with correlated arms, for example, split-body trials. For multi-arm trials with correlated arms, the variance of a contrast is not the sum of the arm-based variances, but comes with a correlation term. This may lead to violations of variance consistency, and the inconsistency of variances may even propagate to the NMA estimates. We explain this using a geometric analogy where three-arm trials correspond to triangles and four-arm trials correspond to tetrahedrons. We also investigate which information has to be extracted for a multi-arm trial with correlated arms and provide an algorithm to analyze NMAs including such trials.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"961-974"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-07-10DOI: 10.1017/rsm.2025.10018
Barbara Nussbaumer-Streit, Dominic Ledinger, Christina Kien, Irma Klerings, Emma Persad, Andrea Chapman, Claus Nowak, Arianna Gadinger, Lisa Affengruber, Maureen Smith, Gerald Gartlehner, Ursula Griebler
Background: Involving knowledge users (KUs) such as patients, clinicians, or health policymakers is particularly relevant when conducting rapid reviews (RRs), as they should be tailored to decision-makers' needs. However, little is known about how common KU involvement currently is in RRs.
Objectives: We wanted to assess the proportion of KU involvement reported in recently published RRs (2021 onwards), which groups of KUs were involved in each phase of the RR process, to what extent, and which factors were associated with KU involvement in RRs.
Methods: We conducted a meta-research cross-sectional study. A systematic literature search in Ovid MEDLINE and Epistemonikos in January 2024 identified 2,493 unique records. We dually screened the identified records (partly with assistance from an artificial intelligence (AI)-based application) until we reached the a priori calculated sample size of 104 RRs. We dually extracted data and analyzed it descriptively.
Results: The proportion of RRs that reported KU involvement was 19% (95% confidence interval [CI]: 12%-28%). Most often, KUs were involved during the initial preparation of the RR, the systematic searches, and the interpretation and dissemination of results. Researchers/content experts and public/patient partners were the KU groups most often involved. KU involvement was more common in RRs focusing on patient involvement/shared decision-making, having a published protocol, and being commissioned.
Conclusions: Reporting KU involvement in published RRs is uncommon and often vague. Future research should explore barriers and facilitators for KU involvement and its reporting in RRs. Guidance regarding reporting on KU involvement in RRs is needed.
{"title":"Knowledge user involvement is still uncommon in published rapid reviews-a meta-research cross-sectional study.","authors":"Barbara Nussbaumer-Streit, Dominic Ledinger, Christina Kien, Irma Klerings, Emma Persad, Andrea Chapman, Claus Nowak, Arianna Gadinger, Lisa Affengruber, Maureen Smith, Gerald Gartlehner, Ursula Griebler","doi":"10.1017/rsm.2025.10018","DOIUrl":"10.1017/rsm.2025.10018","url":null,"abstract":"<p><strong>Background: </strong>Involving knowledge users (KUs) such as patients, clinicians, or health policymakers is particularly relevant when conducting rapid reviews (RRs), as they should be tailored to decision-makers' needs. However, little is known about how common KU involvement currently is in RRs.</p><p><strong>Objectives: </strong>We wanted to assess the proportion of KU involvement reported in recently published RRs (2021 onwards), which groups of KUs were involved in each phase of the RR process, to what extent, and which factors were associated with KU involvement in RRs.</p><p><strong>Methods: </strong>We conducted a meta-research cross-sectional study. A systematic literature search in Ovid MEDLINE and Epistemonikos in January 2024 identified 2,493 unique records. We dually screened the identified records (partly with assistance from an artificial intelligence (AI)-based application) until we reached the a priori calculated sample size of 104 RRs. We dually extracted data and analyzed it descriptively.</p><p><strong>Results: </strong>The proportion of RRs that reported KU involvement was 19% (95% confidence interval [CI]: 12%-28%). Most often, KUs were involved during the initial preparation of the RR, the systematic searches, and the interpretation and dissemination of results. Researchers/content experts and public/patient partners were the KU groups most often involved. KU involvement was more common in RRs focusing on patient involvement/shared decision-making, having a published protocol, and being commissioned.</p><p><strong>Conclusions: </strong>Reporting KU involvement in published RRs is uncommon and often vague. Future research should explore barriers and facilitators for KU involvement and its reporting in RRs. Guidance regarding reporting on KU involvement in RRs is needed.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"876-899"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-07-22DOI: 10.1017/rsm.2025.10025
Isa Spiero, Artuur M Leeuwenberg, Karel G M Moons, Lotty Hooft, Johanna A A Damen
Systematic reviews (SRs) synthesize evidence through a rigorous, labor-intensive, and costly process. To accelerate the title-abstract screening phase of SRs, several artificial intelligence (AI)-based semi-automated screening tools have been developed to reduce workload by prioritizing relevant records. However, their performance is primarily evaluated for SRs of intervention studies, which generally have well-structured abstracts. Here, we evaluate whether screening tool performance is equally effective for SRs of prognosis studies that have larger heterogeneity between abstracts. We conducted retrospective simulations on prognosis and intervention reviews using a screening tool (ASReview). We also evaluated the effects of review scope (i.e., breadth of the research question), number of (relevant) records, and modeling methods within the tool. Performance was assessed in terms of recall (i.e., sensitivity), precision at 95% recall (i.e., positive predictive value at 95% recall), and workload reduction (work saved over sampling at 95% recall [WSS@95%]). The WSS@95% was slightly worse for prognosis reviews (range: 0.324-0.597) than for intervention reviews (range: 0.613-0.895). The precision was higher for prognosis (range: 0.115-0.400) compared to intervention reviews (range: 0.024-0.057). These differences were primarily due to the larger number of relevant records in the prognosis reviews. The modeling methods and the scope of the prognosis review did not significantly impact tool performance. We conclude that the larger abstract heterogeneity of prognosis studies does not substantially affect the effectiveness of screening tools for SRs of prognosis. Further evaluation studies including a standardized evaluation framework are needed to enable prospective decisions on the reliable use of screening tools.
{"title":"Evaluation of semi-automated record screening methods for systematic reviews of prognosis studies and intervention studies.","authors":"Isa Spiero, Artuur M Leeuwenberg, Karel G M Moons, Lotty Hooft, Johanna A A Damen","doi":"10.1017/rsm.2025.10025","DOIUrl":"10.1017/rsm.2025.10025","url":null,"abstract":"<p><p>Systematic reviews (SRs) synthesize evidence through a rigorous, labor-intensive, and costly process. To accelerate the title-abstract screening phase of SRs, several artificial intelligence (AI)-based semi-automated screening tools have been developed to reduce workload by prioritizing relevant records. However, their performance is primarily evaluated for SRs of intervention studies, which generally have well-structured abstracts. Here, we evaluate whether screening tool performance is equally effective for SRs of prognosis studies that have larger heterogeneity between abstracts. We conducted retrospective simulations on prognosis and intervention reviews using a screening tool (ASReview). We also evaluated the effects of review scope (i.e., breadth of the research question), number of (relevant) records, and modeling methods within the tool. Performance was assessed in terms of recall (i.e., sensitivity), precision at 95% recall (i.e., positive predictive value at 95% recall), and workload reduction (work saved over sampling at 95% recall [WSS@95%]). The WSS@95% was slightly worse for prognosis reviews (range: 0.324-0.597) than for intervention reviews (range: 0.613-0.895). The precision was higher for prognosis (range: 0.115-0.400) compared to intervention reviews (range: 0.024-0.057). These differences were primarily due to the larger number of relevant records in the prognosis reviews. The modeling methods and the scope of the prognosis review did not significantly impact tool performance. We conclude that the larger abstract heterogeneity of prognosis studies does not substantially affect the effectiveness of screening tools for SRs of prognosis. Further evaluation studies including a standardized evaluation framework are needed to enable prospective decisions on the reliable use of screening tools.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"975-989"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-08-28DOI: 10.1017/rsm.2025.10023
Klas Moberg, Carl Gornitzki
Our objective was to evaluate the recall and number needed to read (NNR) for the Cochrane RCT Classifier compared to and in combination with established search filters developed for Ovid MEDLINE and Embase.com. A gold standard set of 1,103 randomized controlled trials (RCTs) was created to calculate recall for the Cochrane RCT Classifier in Covidence, the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE and the Cochrane Embase RCT filter for Embase.com. In addition, the classifier and the filters were validated in three case studies using reports from the Swedish Agency for Health Technology Assessment and Assessment of Social Services to assess impact on search results and NNR. The Cochrane RCT Classifier had the highest recall with 99.64% followed by the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE with 98.73% and the Cochrane Embase RCT filter with 98.46%. However, the Cochrane RCT Classifier had a higher NNR than the RCT filters in all case studies. Combining the RCT filters with the Cochrane RCT Classifier reduced NNR compared to using the RCT filters alone while achieving a recall of 98.46% for the Ovid MEDLINE/RCT Classifier combination and 98.28% for the Embase/RCT Classifier combination. In conclusion, we found that the Cochrane RCT Classifier in Covidence has a higher recall than established search filters but also a higher NNR. Thus, using the Cochrane RCT Classifier instead of current state-of-the-art RCT filters would lead to an increased workload in the screening process. A viable option with a lower NNR than RCT filters, at the cost of a slight decrease in recall, is to combine the Cochrane RCT Classifier with RCT filters in database searches.
{"title":"Combining search filters for randomized controlled trials with the Cochrane RCT Classifier in Covidence: a methodological validation study.","authors":"Klas Moberg, Carl Gornitzki","doi":"10.1017/rsm.2025.10023","DOIUrl":"10.1017/rsm.2025.10023","url":null,"abstract":"<p><p>Our objective was to evaluate the recall and number needed to read (NNR) for the Cochrane RCT Classifier compared to and in combination with established search filters developed for Ovid MEDLINE and Embase.com. A gold standard set of 1,103 randomized controlled trials (RCTs) was created to calculate recall for the Cochrane RCT Classifier in Covidence, the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE and the Cochrane Embase RCT filter for Embase.com. In addition, the classifier and the filters were validated in three case studies using reports from the Swedish Agency for Health Technology Assessment and Assessment of Social Services to assess impact on search results and NNR. The Cochrane RCT Classifier had the highest recall with 99.64% followed by the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE with 98.73% and the Cochrane Embase RCT filter with 98.46%. However, the Cochrane RCT Classifier had a higher NNR than the RCT filters in all case studies. Combining the RCT filters with the Cochrane RCT Classifier reduced NNR compared to using the RCT filters alone while achieving a recall of 98.46% for the Ovid MEDLINE/RCT Classifier combination and 98.28% for the Embase/RCT Classifier combination. In conclusion, we found that the Cochrane RCT Classifier in Covidence has a higher recall than established search filters but also a higher NNR. Thus, using the Cochrane RCT Classifier instead of current state-of-the-art RCT filters would lead to an increased workload in the screening process. A viable option with a lower NNR than RCT filters, at the cost of a slight decrease in recall, is to combine the Cochrane RCT Classifier with RCT filters in database searches.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"953-960"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657657/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In meta-analyses of survival rates, precision information (i.e., standard errors (SEs) or confidence intervals) are often missing in clinical studies. In current practice, such studies are often excluded from the synthesis analyses. However, the naïve deletion of these incomplete data can produce serious biases and loss of precision in pooled estimators. To address these issues, we developed a simple but effective method to impute precision information using commonly available statistics from individual studies, such as sample size, number of events, and risk set size at a time point of interest. By applying this new method, we can effectively circumvent the deletion of incomplete data, resultant biases, and losses of precision. Based on extensive simulation studies, the developed method markedly improves the accuracy and precision of the pooled estimators compared to those of naïve analyses that delete studies with missing precision. Furthermore, the performance of the proposed method was not significantly inferior to the ideal case, where there was no missing precision information. However, for studies for which the risk set size at the time of interest was not available, the proposed method runs the risk of overestimating the SE. Although the proposed method is a single-imputation method, the simulations show that there is no underestimation bias of the SE, even though the proposed method does not consider the uncertainty of missing values. To demonstrate the robustness of our proposed methods, they were applied in a systematic review of radiotherapy data. An R package was developed to implement the proposed procedure.
{"title":"Simple imputation method for meta-analysis of survival rates when precision information is missing.","authors":"Kazushi Maruo, Yusuke Yamaguchi, Ryota Ishii, Hisashi Noma, Masahiko Gosho","doi":"10.1017/rsm.2025.10024","DOIUrl":"10.1017/rsm.2025.10024","url":null,"abstract":"<p><p>In meta-analyses of survival rates, precision information (i.e., standard errors (SEs) or confidence intervals) are often missing in clinical studies. In current practice, such studies are often excluded from the synthesis analyses. However, the naïve deletion of these incomplete data can produce serious biases and loss of precision in pooled estimators. To address these issues, we developed a simple but effective method to impute precision information using commonly available statistics from individual studies, such as sample size, number of events, and risk set size at a time point of interest. By applying this new method, we can effectively circumvent the deletion of incomplete data, resultant biases, and losses of precision. Based on extensive simulation studies, the developed method markedly improves the accuracy and precision of the pooled estimators compared to those of naïve analyses that delete studies with missing precision. Furthermore, the performance of the proposed method was not significantly inferior to the ideal case, where there was no missing precision information. However, for studies for which the risk set size at the time of interest was not available, the proposed method runs the risk of overestimating the SE. Although the proposed method is a single-imputation method, the simulations show that there is no underestimation bias of the SE, even though the proposed method does not consider the uncertainty of missing values. To demonstrate the robustness of our proposed methods, they were applied in a systematic review of radiotherapy data. An R package was developed to implement the proposed procedure.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"937-952"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657670/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-10DOI: 10.1017/rsm.2025.10034
Viet-Thi Tran, Carolina Grana Possamai, Isabelle Boutron, Philippe Ravaud
A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword "diabetes." We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.
系统评价的关键步骤包括定义搜索策略,使用关键词和布尔逻辑来过滤电子数据库。我们假设可以使用大型语言模型(llm)作为搜索方程的替代方案来筛选电子数据库中的文章。为了研究这个问题,我们比较了在电子数据库中识别随机对照试验(rct)的两种方法:使用Cochrane高敏感搜索筛选数据库和由法学硕士评估。我们检索了PubMed索引中发表日期在2024年9月1日至9月30日之间的研究,使用唯一的关键词“糖尿病”。我们比较了Cochrane高敏感检索的性能和通过gpt - 40 -mini直接从数据库中提取的所有标题和摘要的评估,以确定rct。参考标准是由两名独立审稿人对检索到的文章进行人工筛选。检索到6377条记录,其中210条(3.5%)为rct的主要报告。Cochrane高灵敏度搜索过滤了2197条记录,遗漏了1项RCT(灵敏度99.5%,95% CI 97.4% ~ 100%;特异性67.8%,95% CI 66.6% ~ 68.9%)。通过GPT对电子数据库中的所有标题和摘要进行评估,筛选了1080条记录,并纳入了所有210篇rct的主要报告(敏感性100%,95% CI 98.3%至100%;特异性85.9%,95% CI 85.0%至86.8%)。法学硕士可以筛选电子数据库中的所有文章,以确定rct作为Cochrane高灵敏度搜索的替代方法。这就要求对法学硕士进行评估,以替代严格的搜索策略。
{"title":"Using large language models to directly screen electronic databases as an alternative to traditional search strategies such as the Cochrane highly sensitive search for filtering randomized controlled trials in systematic reviews.","authors":"Viet-Thi Tran, Carolina Grana Possamai, Isabelle Boutron, Philippe Ravaud","doi":"10.1017/rsm.2025.10034","DOIUrl":"10.1017/rsm.2025.10034","url":null,"abstract":"<p><p>A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword \"diabetes.\" We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1035-1041"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657644/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}