Pub Date : 2025-03-01Epub Date: 2025-03-21DOI: 10.1017/rsm.2025.3
Xuan Qin, Minghong Yao, Xiaochao Luo, Jiali Liu, Yu Ma, Yanmei Liu, Hao Li, Ke Deng, Kang Zou, Ling Li, Xin Sun
Machine learning (ML) models have been developed to identify randomised controlled trials (RCTs) to accelerate systematic reviews (SRs). However, their use has been limited due to concerns about their performance and practical benefits. We developed a high-recall ensemble learning model using Cochrane RCT data to enhance the identification of RCTs for rapid title and abstract screening in SRs and evaluated the model externally with our annotated RCT datasets. Additionally, we assessed the practical impact in terms of labour time savings and recall improvement under two scenarios: ML-assisted double screening (where ML and one reviewer screened all citations in parallel) and ML-assisted stepwise screening (where ML flagged all potential RCTs, and at least two reviewers subsequently filtered the flagged citations). Our model achieved twice the precision compared to the existing SVM model while maintaining a recall of 0.99 in both internal and external tests. In a practical evaluation with ML-assisted double screening, our model led to significant labour time savings (average 45.4%) and improved recall (average 0.998 compared to 0.919 for a single reviewer). In ML-assisted stepwise screening, the model performed similarly to standard manual screening but with average labour time savings of 74.4%. In conclusion, compared with existing methods, the proposed model can reduce workload while maintaining comparable recall when identifying RCTs during the title and abstract screening stages, thereby accelerating SRs. We propose practical recommendations to effectively apply ML-assisted manual screening when conducting SRs, depending on reviewer availability (ML-assisted double screening) or time constraints (ML-assisted stepwise screening).
{"title":"Machine learning for identifying randomised controlled trials when conducting systematic reviews: Development and evaluation of its impact on practice.","authors":"Xuan Qin, Minghong Yao, Xiaochao Luo, Jiali Liu, Yu Ma, Yanmei Liu, Hao Li, Ke Deng, Kang Zou, Ling Li, Xin Sun","doi":"10.1017/rsm.2025.3","DOIUrl":"10.1017/rsm.2025.3","url":null,"abstract":"<p><p>Machine learning (ML) models have been developed to identify randomised controlled trials (RCTs) to accelerate systematic reviews (SRs). However, their use has been limited due to concerns about their performance and practical benefits. We developed a high-recall ensemble learning model using Cochrane RCT data to enhance the identification of RCTs for rapid title and abstract screening in SRs and evaluated the model externally with our annotated RCT datasets. Additionally, we assessed the practical impact in terms of labour time savings and recall improvement under two scenarios: ML-assisted double screening (where ML and one reviewer screened all citations in parallel) and ML-assisted stepwise screening (where ML flagged all potential RCTs, and at least two reviewers subsequently filtered the flagged citations). Our model achieved twice the precision compared to the existing SVM model while maintaining a recall of 0.99 in both internal and external tests. In a practical evaluation with ML-assisted double screening, our model led to significant labour time savings (average 45.4%) and improved recall (average 0.998 compared to 0.919 for a single reviewer). In ML-assisted stepwise screening, the model performed similarly to standard manual screening but with average labour time savings of 74.4%. In conclusion, compared with existing methods, the proposed model can reduce workload while maintaining comparable recall when identifying RCTs during the title and abstract screening stages, thereby accelerating SRs. We propose practical recommendations to effectively apply ML-assisted manual screening when conducting SRs, depending on reviewer availability (ML-assisted double screening) or time constraints (ML-assisted stepwise screening).</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 2","pages":"350-363"},"PeriodicalIF":6.1,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527483/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-10DOI: 10.1017/rsm.2025.2
David M Phillippo, Antonio Remiro-Azócar, Anna Heath, Gianluca Baio, Sofia Dias, A E Ades, Nicky J Welton
Effect modification occurs when a covariate alters the relative effectiveness of treatment compared to control. It is widely understood that, when effect modification is present, treatment recommendations may vary by population and by subgroups within the population. Population-adjustment methods are increasingly used to adjust for differences in effect modifiers between study populations and to produce population-adjusted estimates in a relevant target population for decision-making. It is also widely understood that marginal and conditional estimands for non-collapsible effect measures, such as odds ratios or hazard ratios, do not in general coincide even without effect modification. However, the consequences of both non-collapsibility and effect modification together are little-discussed in the literature.In this article, we set out the definitions of conditional and marginal estimands, illustrate their properties when effect modification is present, and discuss the implications for decision-making. In particular, we show that effect modification can result in conflicting treatment rankings between conditional and marginal estimates. This is because conditional and marginal estimands correspond to different decision questions that are no longer aligned when effect modification is present. For time-to-event outcomes, the presence of covariates implies that marginal hazard ratios are time-varying, and effect modification can cause marginal hazard curves to cross. We conclude with practical recommendations for decision-making in the presence of effect modification, based on pragmatic comparisons of both conditional and marginal estimates in the decision target population. Currently, multilevel network meta-regression is the only population-adjustment method capable of producing both conditional and marginal estimates, in any decision target population.
{"title":"Effect modification and non-collapsibility together may lead to conflicting treatment decisions: A review of marginal and conditional estimands and recommendations for decision-making.","authors":"David M Phillippo, Antonio Remiro-Azócar, Anna Heath, Gianluca Baio, Sofia Dias, A E Ades, Nicky J Welton","doi":"10.1017/rsm.2025.2","DOIUrl":"10.1017/rsm.2025.2","url":null,"abstract":"<p><p>Effect modification occurs when a covariate alters the relative effectiveness of treatment compared to control. It is widely understood that, when effect modification is present, treatment recommendations may vary by population and by subgroups within the population. Population-adjustment methods are increasingly used to adjust for differences in effect modifiers between study populations and to produce population-adjusted estimates in a relevant target population for decision-making. It is also widely understood that marginal and conditional estimands for non-collapsible effect measures, such as odds ratios or hazard ratios, do not in general coincide even without effect modification. However, the consequences of both non-collapsibility and effect modification together are little-discussed in the literature.In this article, we set out the definitions of conditional and marginal estimands, illustrate their properties when effect modification is present, and discuss the implications for decision-making. In particular, we show that effect modification can result in conflicting treatment rankings between conditional and marginal estimates. This is because conditional and marginal estimands correspond to different decision questions that are no longer aligned when effect modification is present. For time-to-event outcomes, the presence of covariates implies that marginal hazard ratios are time-varying, and effect modification can cause marginal hazard curves to cross. We conclude with practical recommendations for decision-making in the presence of effect modification, based on pragmatic comparisons of both conditional and marginal estimates in the decision target population. Currently, multilevel network meta-regression is the only population-adjustment method capable of producing both conditional and marginal estimates, in any decision target population.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 2","pages":"323-349"},"PeriodicalIF":6.1,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527544/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CausalMetaR: An R package for performing causally interpretable meta-analyses - ERRATUM.","authors":"Guanbo Wang, Sean McGrath, Yi Lian","doi":"10.1017/rsm.2025.22","DOIUrl":"10.1017/rsm.2025.22","url":null,"abstract":"","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 2","pages":"441"},"PeriodicalIF":6.1,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527490/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-10DOI: 10.1017/rsm.2025.4
Kaitlyn G Fitzgerald, David Khella, Avery Charles, Elizabeth Tipton
Results of meta-analyses are of interest not only to researchers but often to policy-makers and other decision-makers (e.g., in education and medicine), and visualizations play an important role in communicating data and statistical evidence to the broader public. Therefore, the potential audience of meta-analytic visualizations is broad. However, the most common meta-analytic visualization - the forest plot - uses non-optimal design principles that do not align with data visualization best practices and relies on statistical knowledge and conventions not likely to be familiar to a broad audience. Previously, the Meta-Analytic Rain Cloud (MARC) plot has been shown to be an effective alternative to a forest plot when communicating the results of a small meta-analysis to education practitioners. However, the original MARC plot design was not well-suited for meta-analyses with large numbers of effect sizes as is common across the social sciences. This paper presents an extension of the MARC plot, intended for effective communication of moderate to large meta-analyses (k = 10, 20, 50, 100 studies). We discuss the design principles of the MARC plot, grounded in the data visualization and cognitive science literature. We then present the methods and results of a randomized survey experiment to evaluate the revised MARC plot in comparison to the original MARC plot, the forest plot, and a bar plot. We find that the revised MARC plot is more effective for communicating moderate to large meta-analyses to non-research audiences, offering a 0.30, 0.34, and 1.07 standard deviation improvement in chart users' scores compared to the original MARC plot, forest plot, and bar plot, respectively.
{"title":"Meta-analytic rain cloud plots: Improving evidence communication through data visualization design principles.","authors":"Kaitlyn G Fitzgerald, David Khella, Avery Charles, Elizabeth Tipton","doi":"10.1017/rsm.2025.4","DOIUrl":"10.1017/rsm.2025.4","url":null,"abstract":"<p><p>Results of meta-analyses are of interest not only to researchers but often to policy-makers and other decision-makers (e.g., in education and medicine), and visualizations play an important role in communicating data and statistical evidence to the broader public. Therefore, the potential audience of meta-analytic visualizations is broad. However, the most common meta-analytic visualization - the forest plot - uses non-optimal design principles that do not align with data visualization best practices and relies on statistical knowledge and conventions not likely to be familiar to a broad audience. Previously, the Meta-Analytic Rain Cloud (MARC) plot has been shown to be an effective alternative to a forest plot when communicating the results of a small meta-analysis to education practitioners. However, the original MARC plot design was not well-suited for meta-analyses with large numbers of effect sizes as is common across the social sciences. This paper presents an extension of the MARC plot, intended for effective communication of moderate to large meta-analyses (<i>k</i> = 10, 20, 50, 100 studies). We discuss the design principles of the MARC plot, grounded in the data visualization and cognitive science literature. We then present the methods and results of a randomized survey experiment to evaluate the revised MARC plot in comparison to the original MARC plot, the forest plot, and a bar plot. We find that the revised MARC plot is more effective for communicating moderate to large meta-analyses to non-research audiences, offering a 0.30, 0.34, and 1.07 standard deviation improvement in chart users' scores compared to the original MARC plot, forest plot, and bar plot, respectively.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 2","pages":"364-382"},"PeriodicalIF":6.1,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527513/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-10DOI: 10.1017/rsm.2024.17
Georgios F Nikolaidis, Beth Woods, Stephen Palmer, Sylwia Bujkiewicz, Marta O Soares
Limited evidence on relative effectiveness is common in Health Technology Assessment (HTA), often due to sparse evidence on the population of interest or study-design constraints. When evidence directly relating to the policy decision is limited, the evidence base could be extended to incorporate indirectly related evidence. For instance, a sparse evidence base in children could borrow strength from evidence in adults to improve estimation and reduce uncertainty. In HTA, indirect evidence has typically been either disregarded ('splitting'; no information-sharing) or included without considering any differences ('lumping'; full information-sharing). However, sophisticated methods that impose moderate degrees of information-sharing have been proposed. We describe and implement multiple information-sharing methods in a case-study evaluating the effectiveness, cost-effectiveness and value of further research of intravenous immunoglobulin for severe sepsis and septic shock. We also provide metrics to determine the degree of information-sharing. Results indicate that method choice can have significant impact. Across information-sharing models, odds ratio estimates ranged between 0.55 and 0.90 and incremental cost-effectiveness ratios between £16,000-52,000 per quality-adjusted life year gained. The need for a future trial also differed by information-sharing model. Heterogeneity in the indirect evidence should also be carefully considered, as it may significantly impact estimates. We conclude that when indirect evidence is relevant to an assessment of effectiveness, the full range of information-sharing methods should be considered. The final selection should be based on a deliberative process that considers not only the plausibility of the methods' assumptions but also the imposed degree of information-sharing.
{"title":"Methods for information-sharing in network meta-analysis: Implications for inference and policy.","authors":"Georgios F Nikolaidis, Beth Woods, Stephen Palmer, Sylwia Bujkiewicz, Marta O Soares","doi":"10.1017/rsm.2024.17","DOIUrl":"10.1017/rsm.2024.17","url":null,"abstract":"<p><p>Limited evidence on relative effectiveness is common in Health Technology Assessment (HTA), often due to sparse evidence on the population of interest or study-design constraints. When evidence directly relating to the policy decision is limited, the evidence base could be extended to incorporate indirectly related evidence. For instance, a sparse evidence base in children could borrow strength from evidence in adults to improve estimation and reduce uncertainty. In HTA, indirect evidence has typically been either disregarded ('splitting'; no information-sharing) or included without considering any differences ('lumping'; full information-sharing). However, sophisticated methods that impose moderate degrees of information-sharing have been proposed. We describe and implement multiple information-sharing methods in a case-study evaluating the effectiveness, cost-effectiveness and value of further research of intravenous immunoglobulin for severe sepsis and septic shock. We also provide metrics to determine the degree of information-sharing. Results indicate that method choice can have significant impact. Across information-sharing models, odds ratio estimates ranged between 0.55 and 0.90 and incremental cost-effectiveness ratios between £16,000-52,000 per quality-adjusted life year gained. The need for a future trial also differed by information-sharing model. Heterogeneity in the indirect evidence should also be carefully considered, as it may significantly impact estimates. We conclude that when indirect evidence is relevant to an assessment of effectiveness, the full range of information-sharing methods should be considered. The final selection should be based on a deliberative process that considers not only the plausibility of the methods' assumptions but also the imposed degree of information-sharing.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 2","pages":"291-307"},"PeriodicalIF":6.1,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527489/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-04-01DOI: 10.1017/rsm.2024.9
Yanfei Li, Elizabeth Ghogomu, Xu Hui, E Fenfen, Fiona Campbell, Hanan Khalil, Xiuxia Li, Marie Gaarder, Promise M Nduku, Howard White, Liangying Hou, Nan Chen, Shenggang Xu, Ning Ma, Xiaoye Hu, Xian Liu, Vivian Welch, Kehu Yang
Mapping reviews (MRs) are crucial for identifying research gaps and enhancing evidence utilization. Despite their increasing use in health and social sciences, inconsistencies persist in both their conceptualization and reporting. This study aims to clarify the conceptual framework and gather reporting items from existing guidance and methodological studies. A comprehensive search was conducted across nine databases and 11 institutional websites, including documents up to January 2024. A total of 68 documents were included, addressing 24 MR terms and 55 definitions, with 39 documents discussing distinctions and overlaps among these terms. From the documents included, 28 reporting items were identified, covering all the steps of the process. Seven documents mentioned reporting on the title, four on the abstract, and 14 on the background. Ten methods-related items appeared in 56 documents, with the median number of documents supporting each item being 34 (interquartile range [IQR]: 27, 39). Four results-related items were mentioned in 18 documents (median: 14.5, IQR: 11.5, 16), and four discussion-related items appeared in 25 documents (median: 5.5, IQR: 3, 13). There was very little guidance about reporting conclusions, acknowledgments, author contributions, declarations of interest, and funding sources. This study proposes a draft 28-item reporting checklist for MRs and has identified terminologies and concepts used to describe MRs. These findings will first be used to inform a Delphi consensus process to develop reporting guidelines for MRs. Additionally, the checklist and definitions could be used to guide researchers in reporting high-quality MRs.
{"title":"Key concepts and reporting recommendations for mapping reviews: A scoping review of 68 guidance and methodological studies.","authors":"Yanfei Li, Elizabeth Ghogomu, Xu Hui, E Fenfen, Fiona Campbell, Hanan Khalil, Xiuxia Li, Marie Gaarder, Promise M Nduku, Howard White, Liangying Hou, Nan Chen, Shenggang Xu, Ning Ma, Xiaoye Hu, Xian Liu, Vivian Welch, Kehu Yang","doi":"10.1017/rsm.2024.9","DOIUrl":"10.1017/rsm.2024.9","url":null,"abstract":"<p><p>Mapping reviews (MRs) are crucial for identifying research gaps and enhancing evidence utilization. Despite their increasing use in health and social sciences, inconsistencies persist in both their conceptualization and reporting. This study aims to clarify the conceptual framework and gather reporting items from existing guidance and methodological studies. A comprehensive search was conducted across nine databases and 11 institutional websites, including documents up to January 2024. A total of 68 documents were included, addressing 24 MR terms and 55 definitions, with 39 documents discussing distinctions and overlaps among these terms. From the documents included, 28 reporting items were identified, covering all the steps of the process. Seven documents mentioned reporting on the title, four on the abstract, and 14 on the background. Ten methods-related items appeared in 56 documents, with the median number of documents supporting each item being 34 (interquartile range [IQR]: 27, 39). Four results-related items were mentioned in 18 documents (median: 14.5, IQR: 11.5, 16), and four discussion-related items appeared in 25 documents (median: 5.5, IQR: 3, 13). There was very little guidance about reporting conclusions, acknowledgments, author contributions, declarations of interest, and funding sources. This study proposes a draft 28-item reporting checklist for MRs and has identified terminologies and concepts used to describe MRs. These findings will first be used to inform a Delphi consensus process to develop reporting guidelines for MRs. Additionally, the checklist and definitions could be used to guide researchers in reporting high-quality MRs.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 1","pages":"157-174"},"PeriodicalIF":6.1,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12631146/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-03-12DOI: 10.1017/rsm.2024.2
Maya B Mathur
In small meta-analyses (e.g., up to 20 studies), the best-performing frequentist methods can yield very wide confidence intervals for the meta-analytic mean, as well as biased and imprecise estimates of the heterogeneity. We investigate the frequentist performance of alternative Bayesian methods that use the invariant Jeffreys prior. This prior has the usual Bayesian motivation, but also has a purely frequentist motivation: the resulting posterior modes correspond to the established Firth bias correction of the maximum likelihood estimator. We consider two forms of the Jeffreys prior for random-effects meta-analysis: the previously established "Jeffreys1" prior treats the heterogeneity as a nuisance parameter, whereas the "Jeffreys2" prior treats both the mean and the heterogeneity as estimands of interest. In a large simulation study, we assess the performance of both Jeffreys priors, considering different types of Bayesian estimates and intervals. We assess point and interval estimation for both the mean and the heterogeneity parameters, comparing to the best-performing frequentist methods. For small meta-analyses of binary outcomes, the Jeffreys2 prior may offer advantages over standard frequentist methods for point and interval estimation of the mean parameter. In these cases, Jeffreys2 can substantially improve efficiency while more often showing nominal frequentist coverage. However, for small meta-analyses of continuous outcomes, standard frequentist methods seem to remain the best choices. The best-performing method for estimating the heterogeneity varied according to the heterogeneity itself. Röver & Friede's R package bayesmeta implements both Jeffreys priors. We also generalize the Jeffreys2 prior to the case of meta-regression.
{"title":"Meta-analysis with Jeffreys priors: Empirical frequentist properties.","authors":"Maya B Mathur","doi":"10.1017/rsm.2024.2","DOIUrl":"10.1017/rsm.2024.2","url":null,"abstract":"<p><p>In small meta-analyses (e.g., up to 20 studies), the best-performing frequentist methods can yield very wide confidence intervals for the meta-analytic mean, as well as biased and imprecise estimates of the heterogeneity. We investigate the frequentist performance of alternative Bayesian methods that use the invariant Jeffreys prior. This prior has the usual Bayesian motivation, but also has a purely frequentist motivation: the resulting posterior modes correspond to the established Firth bias correction of the maximum likelihood estimator. We consider two forms of the Jeffreys prior for random-effects meta-analysis: the previously established \"Jeffreys1\" prior treats the heterogeneity as a nuisance parameter, whereas the \"Jeffreys2\" prior treats both the mean and the heterogeneity as estimands of interest. In a large simulation study, we assess the performance of both Jeffreys priors, considering different types of Bayesian estimates and intervals. We assess point and interval estimation for both the mean and the heterogeneity parameters, comparing to the best-performing frequentist methods. For small meta-analyses of binary outcomes, the Jeffreys2 prior may offer advantages over standard frequentist methods for point and interval estimation of the mean parameter. In these cases, Jeffreys2 can substantially improve efficiency while more often showing nominal frequentist coverage. However, for small meta-analyses of continuous outcomes, standard frequentist methods seem to remain the best choices. The best-performing method for estimating the heterogeneity varied according to the heterogeneity itself. Röver & Friede's R package bayesmeta implements both Jeffreys priors. We also generalize the Jeffreys2 prior to the case of meta-regression.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 1","pages":"87-122"},"PeriodicalIF":6.1,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12621536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-03-10DOI: 10.1017/rsm.2024.16
Farhan Ali, Amanda Swee-Ching Tan, Serena Jun-Wei Wang
Systematic reviews play important roles but manual efforts can be time-consuming given a growing literature. There is a need to use and evaluate automated strategies to accelerate systematic reviews. Here, we comprehensively tested machine learning (ML) models from classical and deep learning model families. We also assessed the performance of prompt engineering via few-shot learning of GPT-3.5 and GPT-4 large language models (LLMs). We further attempted to understand when ML models can help automate screening. These ML models were applied to actual datasets of systematic reviews in education. Results showed that the performance of classical and deep ML models varied widely across datasets, ranging from 1.2 to 75.6% of work saved at 95% recall. LLM prompt engineering produced similarly wide performance variation. We searched for various indicators of whether and how ML screening can help. We discovered that the separability of clusters of relevant versus irrelevant articles in high-dimensional embedding space can strongly predict whether ML screening can help (overall R = 0.81). This simple and generalizable heuristic applied well across datasets and different ML model families. In conclusion, ML screening performance varies tremendously, but researchers and software developers can consider using our cluster separability heuristic in various ways in an ML-assisted screening pipeline.
{"title":"Can machine learning help accelerate article screening for systematic reviews? Yes, when article separability in embedding space is high.","authors":"Farhan Ali, Amanda Swee-Ching Tan, Serena Jun-Wei Wang","doi":"10.1017/rsm.2024.16","DOIUrl":"10.1017/rsm.2024.16","url":null,"abstract":"<p><p>Systematic reviews play important roles but manual efforts can be time-consuming given a growing literature. There is a need to use and evaluate automated strategies to accelerate systematic reviews. Here, we comprehensively tested machine learning (ML) models from classical and deep learning model families. We also assessed the performance of prompt engineering via few-shot learning of GPT-3.5 and GPT-4 large language models (LLMs). We further attempted to understand when ML models can help automate screening. These ML models were applied to actual datasets of systematic reviews in education. Results showed that the performance of classical and deep ML models varied widely across datasets, ranging from 1.2 to 75.6% of work saved at 95% recall. LLM prompt engineering produced similarly wide performance variation. We searched for various indicators of whether and how ML screening can help. We discovered that the separability of clusters of relevant versus irrelevant articles in high-dimensional embedding space can strongly predict whether ML screening can help (overall <i>R</i> = 0.81). This simple and generalizable heuristic applied well across datasets and different ML model families. In conclusion, ML screening performance varies tremendously, but researchers and software developers can consider using our cluster separability heuristic in various ways in an ML-assisted screening pipeline.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 1","pages":"194-210"},"PeriodicalIF":6.1,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12621506/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-03-07DOI: 10.1017/rsm.2024.6
Malgorzata Lagisz, Yefeng Yang, Sarah Young, Shinichi Nakagawa
Systematic searches of published literature are a vital component of systematic reviews. When search strings are not "sensitive," they may miss many relevant studies limiting, or even biasing, the range of evidence available for synthesis. Concerningly, conducting and reporting evaluations (validations) of the sensitivity of the used search strings is rare, according to our survey of published systematic reviews and protocols. Potential reasons may involve a lack of familiarity or inaccessibility of complex sensitivity evaluation approaches. We first clarify the main concepts and principles of search string evaluation. We then present a simple procedure for estimating a relative recall of a search string. It is based on a pre-defined set of "benchmark" publications. The relative recall, that is, the sensitivity of the search string, is the retrieval overlap between the evaluated search string and a search string that captures only the benchmark publications. If there is little overlap (i.e., low recall or sensitivity), the evaluated search string should be improved to ensure that most of the relevant literature can be captured. The presented benchmarking approach can be applied to one or more online databases or search platforms. It is illustrated by five accessible, hands-on tutorials for commonly used online literature sources. Overall, our work provides an assessment of the current state of search string evaluations in published systematic reviews and protocols. It also paves the way to improve evaluation and reporting practices to make evidence synthesis more transparent and robust.
{"title":"A practical guide to evaluating sensitivity of literature search strings for systematic reviews using relative recall.","authors":"Malgorzata Lagisz, Yefeng Yang, Sarah Young, Shinichi Nakagawa","doi":"10.1017/rsm.2024.6","DOIUrl":"10.1017/rsm.2024.6","url":null,"abstract":"<p><p>Systematic searches of published literature are a vital component of systematic reviews. When search strings are not \"sensitive,\" they may miss many relevant studies limiting, or even biasing, the range of evidence available for synthesis. Concerningly, conducting and reporting evaluations (validations) of the sensitivity of the used search strings is rare, according to our survey of published systematic reviews and protocols. Potential reasons may involve a lack of familiarity or inaccessibility of complex sensitivity evaluation approaches. We first clarify the main concepts and principles of search string evaluation. We then present a simple procedure for estimating a relative recall of a search string. It is based on a pre-defined set of \"benchmark\" publications. The relative recall, that is, the sensitivity of the search string, is the retrieval overlap between the evaluated search string and a search string that captures only the benchmark publications. If there is little overlap (i.e., low recall or sensitivity), the evaluated search string should be improved to ensure that most of the relevant literature can be captured. The presented benchmarking approach can be applied to one or more online databases or search platforms. It is illustrated by five accessible, hands-on tutorials for commonly used online literature sources. Overall, our work provides an assessment of the current state of search string evaluations in published systematic reviews and protocols. It also paves the way to improve evaluation and reporting practices to make evidence synthesis more transparent and robust.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 1","pages":"1-14"},"PeriodicalIF":6.1,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12621535/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-03-07DOI: 10.1017/rsm.2024.15
Darren Rajit, Lan Du, Helena Teede, Joanne Enticott
Bibliographic aggregators like OpenAlex and Semantic Scholar offer scope for automated citation searching within systematic review production, promising increased efficiency. This study aimed to evaluate the performance of automated citation searching compared to standard search strategies and examine factors that influence performance. Automated citation searching was simulated on 27 systematic reviews across the OpenAlex and Semantic Scholar databases, across three study areas (health, environmental management and social policy). Performance, measured by recall (proportion of relevant articles identified), precision (proportion of relevant articles identified from all articles identified), and F1-F3 scores (weighted average of recall and precision), was compared to the performance of search strategies originally employed by each systematic review. The associations between systematic review study area, number of included articles, number of seed articles, seed article type, study type inclusion criteria, API choice, and performance was analyzed. Automated citation searching outperformed the reference standard in terms of precision (p < 0.05) and F1 score (p < 0.05) but failed to outperform in terms of recall (p < 0.05) and F3 score (p < 0.05). Study area influenced the performance of automated citation searching, with performance being higher within the field of environmental management compared to social policy. Automated citation searching is best used as a supplementary search strategy in systematic review production where recall is more important that precision, due to inferior recall and F3 score. However, observed outperformance in terms of F1 score and precision suggests that automated citation searching could be helpful in contexts where precision is as important as recall.
{"title":"Automated citation searching in systematic review production: A simulation study.","authors":"Darren Rajit, Lan Du, Helena Teede, Joanne Enticott","doi":"10.1017/rsm.2024.15","DOIUrl":"10.1017/rsm.2024.15","url":null,"abstract":"<p><p>Bibliographic aggregators like OpenAlex and Semantic Scholar offer scope for automated citation searching within systematic review production, promising increased efficiency. This study aimed to evaluate the performance of automated citation searching compared to standard search strategies and examine factors that influence performance. Automated citation searching was simulated on 27 systematic reviews across the OpenAlex and Semantic Scholar databases, across three study areas (health, environmental management and social policy). Performance, measured by recall (proportion of relevant articles identified), precision (proportion of relevant articles identified from all articles identified), and F1-F3 scores (weighted average of recall and precision), was compared to the performance of search strategies originally employed by each systematic review. The associations between systematic review study area, number of included articles, number of seed articles, seed article type, study type inclusion criteria, API choice, and performance was analyzed. Automated citation searching outperformed the reference standard in terms of precision (p < 0.05) and F1 score (p < 0.05) but failed to outperform in terms of recall (p < 0.05) and F3 score (p < 0.05). Study area influenced the performance of automated citation searching, with performance being higher within the field of environmental management compared to social policy. Automated citation searching is best used as a supplementary search strategy in systematic review production where recall is more important that precision, due to inferior recall and F3 score. However, observed outperformance in terms of F1 score and precision suggests that automated citation searching could be helpful in contexts where precision is as important as recall.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 1","pages":"211-227"},"PeriodicalIF":6.1,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12621532/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}