Pub Date : 2025-07-01Epub Date: 2025-06-09DOI: 10.1017/rsm.2025.10016
Alexander Pachanov, Catharina Muente, Julian Hirt, Dawid Pieper
We developed a geographic search filter for retrieving studies about Germany from PubMed. In this study, we aimed to translate and validate it for use in Embase and MEDLINE(R) ALL via Ovid. Adjustments included aligning PubMed field tags with Ovid's syntax, adding a keyword heading field for both databases, and incorporating a correspondence address field for Embase. To validate the filters, we used systematic reviews (SRs) that included studies about Germany without imposing geographic restrictions on their search strategies. Subsequently, we conducted (i) case studies (CSs), applying the filters to the search strategies of the 17 eligible SRs; and (ii) aggregation studies, combining the SRs' search strategies with the 'OR' operator and applying the filters. In the CSs, the filters demonstrated a median sensitivity of 100% in both databases, with interquartile ranges (IQRs) of 100%-100% in Embase and 93.75%-100% in MEDLINE(R) ALL. Median precision improved from 0.11% (IQR: 0.05%-0.30%) to 1.65% (IQR: 0.78%-3.06%) and from 0.19% (IQR: 0.11%-0.60%) to 5.13% (IQR: 1.77%-6.85%), while the number needed to read (NNR) decreased from 893.40 (IQR: 354.81-2,219.58) to 60.44 (IQR: 33.94-128.97) and from 513.29 (IQR: 167.35-930.99) to 19.50 (IQR: 14.66-59.35) for Embase and MEDLINE(R) ALL, respectively. In the aggregation studies, the overall sensitivities were 98.19% and 97.14%, with NNRs of 83.29 and 33.34 in Embase and MEDLINE(R) ALL, respectively. The new Embase and MEDLINE(R) ALL filters for Ovid reliably retrieve studies about Germany, enhancing search precision. The approach described in our study can support search filter developers in translating filters for various topics and contexts.
{"title":"Translation and validation of a geographic search filter to identify studies about Germany in Embase (Ovid) and MEDLINE(R) ALL (Ovid).","authors":"Alexander Pachanov, Catharina Muente, Julian Hirt, Dawid Pieper","doi":"10.1017/rsm.2025.10016","DOIUrl":"10.1017/rsm.2025.10016","url":null,"abstract":"<p><p>We developed a geographic search filter for retrieving studies about Germany from PubMed. In this study, we aimed to translate and validate it for use in Embase and MEDLINE(R) ALL via Ovid. Adjustments included aligning PubMed field tags with Ovid's syntax, adding a keyword heading field for both databases, and incorporating a correspondence address field for Embase. To validate the filters, we used systematic reviews (SRs) that included studies about Germany without imposing geographic restrictions on their search strategies. Subsequently, we conducted (i) case studies (CSs), applying the filters to the search strategies of the 17 eligible SRs; and (ii) aggregation studies, combining the SRs' search strategies with the 'OR' operator and applying the filters. In the CSs, the filters demonstrated a median sensitivity of 100% in both databases, with interquartile ranges (IQRs) of 100%-100% in Embase and 93.75%-100% in MEDLINE(R) ALL. Median precision improved from 0.11% (IQR: 0.05%-0.30%) to 1.65% (IQR: 0.78%-3.06%) and from 0.19% (IQR: 0.11%-0.60%) to 5.13% (IQR: 1.77%-6.85%), while the number needed to read (NNR) decreased from 893.40 (IQR: 354.81-2,219.58) to 60.44 (IQR: 33.94-128.97) and from 513.29 (IQR: 167.35-930.99) to 19.50 (IQR: 14.66-59.35) for Embase and MEDLINE(R) ALL, respectively. In the aggregation studies, the overall sensitivities were 98.19% and 97.14%, with NNRs of 83.29 and 33.34 in Embase and MEDLINE(R) ALL, respectively. The new Embase and MEDLINE(R) ALL filters for Ovid reliably retrieve studies about Germany, enhancing search precision. The approach described in our study can support search filter developers in translating filters for various topics and contexts.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 4","pages":"688-700"},"PeriodicalIF":6.1,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527497/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-04-25DOI: 10.1017/rsm.2025.20
Zahra Premji, Chris Cooper
Trials registry records represent a challenge in deduplication compared to deduplicating studies reported in journals and exported from bibliographic databases such as MEDLINE. We demonstrate why this is the case and propose a method to deduplicate registry records from the WHO International Clinical Trials Registry Platform (ICTRP) and ClinicalTrials.gov (CTG) specifically in the reference management tool EndNote (desktop version). We believe that our method is not only more efficient but that it will minimise the risk of registry records being incorrectly removed as duplicates in automated deduplication. The method has seven steps and is detailed in this tutorial as a step-by-step guide.
{"title":"Same, same, but different: A method to harmonise and deduplicate study records from WHO ICTRP and ClinicalTrials.gov prior to screening.","authors":"Zahra Premji, Chris Cooper","doi":"10.1017/rsm.2025.20","DOIUrl":"10.1017/rsm.2025.20","url":null,"abstract":"<p><p>Trials registry records represent a challenge in deduplication compared to deduplicating studies reported in journals and exported from bibliographic databases such as MEDLINE. We demonstrate why this is the case and propose a method to deduplicate registry records from the WHO International Clinical Trials Registry Platform (ICTRP) and ClinicalTrials.gov (CTG) specifically in the reference management tool EndNote (desktop version). We believe that our method is not only more efficient but that it will minimise the risk of registry records being incorrectly removed as duplicates in automated deduplication. The method has seven steps and is detailed in this tutorial as a step-by-step guide.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 4","pages":"587-600"},"PeriodicalIF":6.1,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527485/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-04-24DOI: 10.1017/rsm.2025.16
Justin Clark, Belinda Barton, Loai Albarqouni, Oyungerel Byambasuren, Tanisha Jowsey, Justin Keogh, Tian Liang, Christian Moro, Hayley O'Neill, Mark Jones
Introduction: With the increasing accessibility of tools such as ChatGPT, Copilot, DeepSeek, Dall-E, and Gemini, generative artificial intelligence (GenAI) has been poised as a potential, research timesaving tool, especially for synthesising evidence. Our objective was to determine whether GenAI can assist with evidence synthesis by assessing its performance using its accuracy, error rates, and time savings compared to the traditional expert-driven approach.
Methods: To systematically review the evidence, we searched five databases on 17 January 2025, synthesised outcomes reporting on the accuracy, error rates, or time taken, and appraised the risk-of-bias using a modified version of QUADAS-2.
Results: We identified 3,071 unique records, 19 of which were included in our review. Most studies had a high or unclear risk-of-bias in Domain 1A: review selection, Domain 2A: GenAI conduct, and Domain 1B: applicability of results. When used for (1) searching GenAI missed 68% to 96% (median = 91%) of studies, (2) screening made incorrect inclusion decisions ranging from 0% to 29% (median = 10%); and incorrect exclusion decisions ranging from 1% to 83% (median = 28%), (3) incorrect data extractions ranging from 4% to 31% (median = 14%), (4) incorrect risk-of-bias assessments ranging from 10% to 56% (median = 27%).
Conclusion: Our review shows that the current evidence does not support GenAI use in evidence synthesis without human involvement or oversight. However, for most tasks other than searching, GenAI may have a role in assisting humans with evidence synthesis.
{"title":"Generative artificial intelligence use in evidence synthesis: A systematic review.","authors":"Justin Clark, Belinda Barton, Loai Albarqouni, Oyungerel Byambasuren, Tanisha Jowsey, Justin Keogh, Tian Liang, Christian Moro, Hayley O'Neill, Mark Jones","doi":"10.1017/rsm.2025.16","DOIUrl":"10.1017/rsm.2025.16","url":null,"abstract":"<p><strong>Introduction: </strong>With the increasing accessibility of tools such as ChatGPT, Copilot, DeepSeek, Dall-E, and Gemini, generative artificial intelligence (GenAI) has been poised as a potential, research timesaving tool, especially for synthesising evidence. Our objective was to determine whether GenAI can assist with evidence synthesis by assessing its performance using its accuracy, error rates, and time savings compared to the traditional expert-driven approach.</p><p><strong>Methods: </strong>To systematically review the evidence, we searched five databases on 17 January 2025, synthesised outcomes reporting on the accuracy, error rates, or time taken, and appraised the risk-of-bias using a modified version of QUADAS-2.</p><p><strong>Results: </strong>We identified 3,071 unique records, 19 of which were included in our review. Most studies had a high or unclear risk-of-bias in Domain 1A: review selection, Domain 2A: GenAI conduct, and Domain 1B: applicability of results. When used for (1) searching GenAI missed 68% to 96% (median = 91%) of studies, (2) screening made incorrect inclusion decisions ranging from 0% to 29% (median = 10%); and incorrect exclusion decisions ranging from 1% to 83% (median = 28%), (3) incorrect data extractions ranging from 4% to 31% (median = 14%), (4) incorrect risk-of-bias assessments ranging from 10% to 56% (median = 27%).</p><p><strong>Conclusion: </strong>Our review shows that the current evidence does not support GenAI use in evidence synthesis without human involvement or oversight. However, for most tasks other than searching, GenAI may have a role in assisting humans with evidence synthesis.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 4","pages":"601-619"},"PeriodicalIF":6.1,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-04-24DOI: 10.1017/rsm.2025.19
Xiangji Ying, Konstantinos I Bougioukas, Dawid Pieper, Evan Mayo-Wilson
When conducting overviews of reviews, investigators must measure and describe the extent to which included systematic reviews (SRs) contain the same primary studies. The corrected covered area (CCA) quantifies overlap by counting primary studies included across a set of SRs. In this article, we introduce a modification to the CCA, the weighted CCA (wCCA), which accounts for differences in information contributed by primary studies. The wCCA adjusts the original CCA by weighting studies based on the square roots of their sample sizes. By weighting primary studies according to their precision, wCCA provides a useful and complementary representation of overlap in evidence syntheses .
{"title":"Weighted corrected covered area (wCCA): A measure of informational overlap among reviews.","authors":"Xiangji Ying, Konstantinos I Bougioukas, Dawid Pieper, Evan Mayo-Wilson","doi":"10.1017/rsm.2025.19","DOIUrl":"10.1017/rsm.2025.19","url":null,"abstract":"<p><p>When conducting overviews of reviews, investigators must measure and describe the extent to which included systematic reviews (SRs) contain the same primary studies. The corrected covered area (CCA) quantifies overlap by counting primary studies included across a set of SRs. In this article, we introduce a modification to the CCA, the weighted CCA (wCCA), which accounts for differences in information contributed by primary studies. The wCCA adjusts the original CCA by weighting studies based on the square roots of their sample sizes. By weighting primary studies according to their precision, wCCA provides a useful and complementary representation of overlap in evidence syntheses .</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 4","pages":"701-708"},"PeriodicalIF":6.1,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziren Jiang, Jialing Liu, Demissie Alemayehu, Joseph C Cappelleri, Devin Abrahami, Yong Chen, Haitao Chu
Matching-adjusted indirect comparison (MAIC) has been increasingly applied in health technology assessments (HTA). By reweighting subjects from a trial with individual participant data (IPD) to match the summary statistics of covariates in another trial with aggregate data (AgD), MAIC enables a comparison of the interventions for the AgD trial population. However, when there are imbalances in effect modifiers with different magnitudes of modification across treatments, contradictory conclusions may arise if MAIC is performed with the IPD and AgD swapped between trials. This can lead to the "MAIC paradox," where different entities reach opposing conclusions about which treatment is more effective, despite analyzing the same data. In this paper, we use synthetic data to illustrate this paradox and emphasize the importance of clearly defining the target population in HTA submissions. Additionally, we recommend making de-identified IPD available to HTA agencies, enabling further indirect comparisons that better reflect the overall population represented by both IPD and AgD trials, as well as other relevant target populations for policy decisions. This would help ensure more accurate and consistent assessments of comparative effectiveness.
{"title":"A critical assessment of matching-adjusted indirect comparisons in relation to target populations.","authors":"Ziren Jiang, Jialing Liu, Demissie Alemayehu, Joseph C Cappelleri, Devin Abrahami, Yong Chen, Haitao Chu","doi":"10.1017/rsm.2025.10","DOIUrl":"10.1017/rsm.2025.10","url":null,"abstract":"<p><p>Matching-adjusted indirect comparison (MAIC) has been increasingly applied in health technology assessments (HTA). By reweighting subjects from a trial with individual participant data (IPD) to match the summary statistics of covariates in another trial with aggregate data (AgD), MAIC enables a comparison of the interventions for the AgD trial population. However, when there are imbalances in effect modifiers with different magnitudes of modification across treatments, contradictory conclusions may arise if MAIC is performed with the IPD and AgD swapped between trials. This can lead to the \"MAIC paradox,\" where different entities reach opposing conclusions about which treatment is more effective, despite analyzing the same data. In this paper, we use synthetic data to illustrate this paradox and emphasize the importance of clearly defining the target population in HTA submissions. Additionally, we recommend making de-identified IPD available to HTA agencies, enabling further indirect comparisons that better reflect the overall population represented by both IPD and AgD trials, as well as other relevant target populations for policy decisions. This would help ensure more accurate and consistent assessments of comparative effectiveness.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 3","pages":"569-574"},"PeriodicalIF":6.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In prognosis studies with time-to-event outcomes, the survivals of groups with high/low biomarker expression are often estimated by the Kaplan-Meier method, and the difference between groups is measured by the hazard ratios (HRs). Since the high/low expressions are usually determined by study-specific cutoff values, synthesizing only HRs for summarizing the prognostic capacity of a biomarker brings heterogeneity in the meta-analysis. The time-dependent summary receiver operating characteristics (SROC) curve was proposed as a cutoff-free summary of the prognostic capacity, extended from the SROC curve in meta-analysis of diagnostic studies. However, estimates of the time-dependent SROC curve may be threatened by reporting bias in that studies with significant outcomes, such as HRs, are more likely to be published and selected in meta-analyses. Under this conjecture, this paper proposes a sensitivity analysis method for quantifying and adjusting reporting bias on the time-dependent SROC curve. We model the publication process determined by the significance of the HRs and introduce a sensitivity analysis method based on the conditional likelihood constrained by some expected proportions of published studies. Simulation studies showed that the proposed method could reduce reporting bias given the correctly-specified marginal selection probability. The proposed method is illustrated on the real-world meta-analysis of Ki67 for breast cancer.
{"title":"Sensitivity analysis for reporting bias on the time-dependent summary receiver operating characteristics curve in meta-analysis of prognosis studies with time-to-event outcomes.","authors":"Yi Zhou, Ao Huang, Satoshi Hattori","doi":"10.1017/rsm.2025.14","DOIUrl":"10.1017/rsm.2025.14","url":null,"abstract":"<p><p>In prognosis studies with time-to-event outcomes, the survivals of groups with high/low biomarker expression are often estimated by the Kaplan-Meier method, and the difference between groups is measured by the hazard ratios (HRs). Since the high/low expressions are usually determined by study-specific cutoff values, synthesizing only HRs for summarizing the prognostic capacity of a biomarker brings heterogeneity in the meta-analysis. The time-dependent summary receiver operating characteristics (SROC) curve was proposed as a cutoff-free summary of the prognostic capacity, extended from the SROC curve in meta-analysis of diagnostic studies. However, estimates of the time-dependent SROC curve may be threatened by reporting bias in that studies with significant outcomes, such as HRs, are more likely to be published and selected in meta-analyses. Under this conjecture, this paper proposes a sensitivity analysis method for quantifying and adjusting reporting bias on the time-dependent SROC curve. We model the publication process determined by the significance of the HRs and introduce a sensitivity analysis method based on the conditional likelihood constrained by some expected proportions of published studies. Simulation studies showed that the proposed method could reduce reporting bias given the correctly-specified marginal selection probability. The proposed method is illustrated on the real-world meta-analysis of Ki67 for breast cancer.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 3","pages":"528-549"},"PeriodicalIF":6.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Corentin J Gosling, Samuele Cortese, Marco Solmi, Belen Haza, Eduard Vieta, Richard Delorme, Paolo Fusar-Poli, Joaquim Radua
A fundamental pillar of science is the estimation of the effect size of associations. However, this task is sometimes difficult and error-prone. To facilitate this process, the R package metaConvert automatically calculates and flexibly converts multiple effect size measures. It applies more than 120 formulas to convert any relevant input data into Cohen's d, Hedges' g, mean difference, odds ratio, risk ratio, incidence rate ratio, correlation coefficient, Fisher's r-to-z transformed correlation coefficient, variability ratio, coefficient of variation ratio, or number needed to treat. Researchers unfamiliar with R can use this software through a browser-based graphical interface (https://metaconvert.org/). We hope this suite will help researchers in the life sciences and other disciplines estimate and convert effect sizes more easily and accurately.
{"title":"metaConvert: an automatic suite for estimation of 11 different effect size measures and flexible conversion across them.","authors":"Corentin J Gosling, Samuele Cortese, Marco Solmi, Belen Haza, Eduard Vieta, Richard Delorme, Paolo Fusar-Poli, Joaquim Radua","doi":"10.1017/rsm.2025.11","DOIUrl":"10.1017/rsm.2025.11","url":null,"abstract":"<p><p>A fundamental pillar of science is the estimation of the effect size of associations. However, this task is sometimes difficult and error-prone. To facilitate this process, the R package metaConvert automatically calculates and flexibly converts multiple effect size measures. It applies more than 120 formulas to convert any relevant input data into Cohen's <i>d</i>, Hedges' <i>g</i>, mean difference, odds ratio, risk ratio, incidence rate ratio, correlation coefficient, Fisher's r-to-z transformed correlation coefficient, variability ratio, coefficient of variation ratio, or number needed to treat. Researchers unfamiliar with R can use this software through a browser-based graphical interface (https://metaconvert.org/). We hope this suite will help researchers in the life sciences and other disciplines estimate and convert effect sizes more easily and accurately.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 3","pages":"575-586"},"PeriodicalIF":6.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527507/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A E Ades, Annabel L Davies, David M Phillippo, Hugo Pedder, Howard Thom, Beatrice Downing, Deborah M Caldwell, Nicky J Welton
The treatment recommendation based on a network meta-analysis (NMA) is usually the single treatment with the highest expected value (EV) on an evaluative function. We explore approaches that recommend multiple treatments and that penalise uncertainty, making them suitable for risk-averse decision-makers. We introduce loss-adjusted EV (LaEV) and compare it to GRADE and three probability-based rankings. We define properties of a valid ranking under uncertainty and other desirable properties of ranking systems. A two-stage process is proposed: the first identifies treatments superior to the reference treatment; the second identifies those that are also within a minimal clinically important difference (MCID) of the best treatment. Decision rules and ranking systems are compared on stylised examples and 10 NMAs used in NICE (National Institute of Health and Care Excellence) guidelines. Only LaEV reliably delivers valid rankings under uncertainty and has all the desirable properties. In 10 NMAs comparing between 5 and 41 treatments, an EV decision maker would recommend 4-14 treatments, and LaEV 0-3 (median 2) fewer. GRADE rules give rise to anomalies, and, like the probability-based rankings, the number of treatments recommended depends on arbitrary probability cutoffs. Among treatments that are superior to the reference, GRADE privileges the more uncertain ones, and in 3/10 cases, GRADE failed to recommend the treatment with the highest EV and LaEV. A two-stage approach based on MCID ensures that EV- and LaEV-based rules recommend a clinically appropriate number of treatments. For a risk-averse decision maker, LaEV is conservative, simple to implement, and has an independent theoretical foundation.
{"title":"Treatment recommendations based on network meta-analysis: Rules for risk-averse decision-makers.","authors":"A E Ades, Annabel L Davies, David M Phillippo, Hugo Pedder, Howard Thom, Beatrice Downing, Deborah M Caldwell, Nicky J Welton","doi":"10.1017/rsm.2025.17","DOIUrl":"10.1017/rsm.2025.17","url":null,"abstract":"<p><p>The treatment recommendation based on a network meta-analysis (NMA) is usually the single treatment with the highest expected value (EV) on an evaluative function. We explore approaches that recommend multiple treatments and that penalise uncertainty, making them suitable for risk-averse decision-makers. We introduce loss-adjusted EV (LaEV) and compare it to GRADE and three probability-based rankings. We define properties of a valid ranking under uncertainty and other desirable properties of ranking systems. A two-stage process is proposed: the first identifies treatments superior to the reference treatment; the second identifies those that are also within a minimal clinically important difference (MCID) of the best treatment. Decision rules and ranking systems are compared on stylised examples and 10 NMAs used in NICE (National Institute of Health and Care Excellence) guidelines. Only LaEV reliably delivers valid rankings under uncertainty and has all the desirable properties. In 10 NMAs comparing between 5 and 41 treatments, an EV decision maker would recommend 4-14 treatments, and LaEV 0-3 (median 2) fewer. GRADE rules give rise to anomalies, and, like the probability-based rankings, the number of treatments recommended depends on arbitrary probability cutoffs. Among treatments that are superior to the reference, GRADE privileges the more uncertain ones, and in 3/10 cases, GRADE failed to recommend the treatment with the highest EV and LaEV. A two-stage approach based on MCID ensures that EV- and LaEV-based rules recommend a clinically appropriate number of treatments. For a risk-averse decision maker, LaEV is conservative, simple to implement, and has an independent theoretical foundation.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 3","pages":"550-568"},"PeriodicalIF":6.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527546/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Biomedical entity normalization is critical to biomedical research because the richness of free-text clinical data, such as progress notes, can often be fully leveraged only after translating words and phrases into structured and coded representations suitable for analysis. Large Language Models (LLMs), in turn, have shown great potential and high performance in a variety of natural language processing (NLP) tasks, but their application for normalization remains understudied.
Methods: We applied both proprietary and open-source LLMs in combination with several rule-based normalization systems commonly used in biomedical research. We used a two-step LLM integration approach, (1) using an LLM to generate alternative phrasings of a source utterance, and (2) to prune candidate UMLS concepts, using a variety of prompting methods. We measure results by $F_{beta }$, where we favor recall over precision, and F1.
Results: We evaluated a total of 5,523 concept terms and text contexts from a publicly available dataset of human-annotated biomedical abstracts. Incorporating GPT-3.5-turbo increased overall $F_{beta }$ and F1 in normalization systems +16.5 and +16.2 (OpenAI embeddings), +9.5 and +7.3 (MetaMapLite), +13.9 and +10.9 (QuickUMLS), and +10.5 and +10.3 (BM25), while the open-source Vicuna model achieved +20.2 and +21.7 (OpenAI embeddings), +10.8 and +12.2 (MetaMapLite), +14.7 and +15 (QuickUMLS), and +15.6 and +18.7 (BM25).
Conclusions: Existing general-purpose LLMs, both propriety and open-source, can be leveraged to greatly improve normalization performance using existing tools, with no fine-tuning.
{"title":"Generalizable and scalable multistage biomedical concept normalization leveraging large language models.","authors":"Nicholas J Dobbins","doi":"10.1017/rsm.2025.9","DOIUrl":"10.1017/rsm.2025.9","url":null,"abstract":"<p><strong>Background: </strong>Biomedical entity normalization is critical to biomedical research because the richness of free-text clinical data, such as progress notes, can often be fully leveraged only after translating words and phrases into structured and coded representations suitable for analysis. Large Language Models (LLMs), in turn, have shown great potential and high performance in a variety of natural language processing (NLP) tasks, but their application for normalization remains understudied.</p><p><strong>Methods: </strong>We applied both proprietary and open-source LLMs in combination with several rule-based normalization systems commonly used in biomedical research. We used a two-step LLM integration approach, (1) using an LLM to generate alternative phrasings of a source utterance, and (2) to prune candidate UMLS concepts, using a variety of prompting methods. We measure results by $F_{beta }$, where we favor recall over precision, and F1.</p><p><strong>Results: </strong>We evaluated a total of 5,523 concept terms and text contexts from a publicly available dataset of human-annotated biomedical abstracts. Incorporating GPT-3.5-turbo increased overall $F_{beta }$ and F1 in normalization systems +16.5 and +16.2 (OpenAI embeddings), +9.5 and +7.3 (MetaMapLite), +13.9 and +10.9 (QuickUMLS), and +10.5 and +10.3 (BM25), while the open-source Vicuna model achieved +20.2 and +21.7 (OpenAI embeddings), +10.8 and +12.2 (MetaMapLite), +14.7 and +15 (QuickUMLS), and +15.6 and +18.7 (BM25).</p><p><strong>Conclusions: </strong>Existing general-purpose LLMs, both propriety and open-source, can be leveraged to greatly improve normalization performance using existing tools, with no fine-tuning.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 3","pages":"479-490"},"PeriodicalIF":6.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527512/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angelika Eisele-Metzger, Judith-Lisa Lieberum, Markus Toews, Waldemar Siemens, Felix Heilmeyer, Christian Haverkamp, Daniel Boehringer, Joerg J Meerpohl
Systematic reviews are essential for evidence-based health care, but conducting them is time- and resource-consuming. To date, efforts have been made to accelerate and (semi-)automate various steps of systematic reviews through the use of artificial intelligence (AI) and the emergence of large language models (LLMs) promises further opportunities. One crucial but complex task within systematic review conduct is assessing the risk of bias (RoB) of included studies. Therefore, the aim of this study was to test the LLM Claude 2 for RoB assessment of 100 randomized controlled trials, published in English language from 2013 onwards, using the revised Cochrane risk of bias tool ('RoB 2'; involving judgements for five specific domains and an overall judgement). We assessed the agreement of RoB judgements by Claude with human judgements published in Cochrane reviews. The observed agreement between Claude and Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 ('outcome measurement'). Cohen's κ was lowest for domain 5 ('selective reporting'; 0.10 (95% confidence interval (CI): -0.10-0.31)) and highest for domain 3 ('missing data'; 0.31 (95% CI: 0.10-0.52)), indicating slight to fair agreement. Fair agreement was found for the overall judgement (Cohen's κ: 0.22 (95% CI: 0.06-0.38)). Sensitivity analyses using alternative prompting techniques or the more recent version Claude 3 did not result in substantial changes. Currently, Claude's RoB 2 judgements cannot replace human RoB assessment. However, the potential of LLMs to support RoB assessment should be further explored.
{"title":"Exploring the potential of Claude 2 for risk of bias assessment: Using a large language model to assess randomized controlled trials with RoB 2.","authors":"Angelika Eisele-Metzger, Judith-Lisa Lieberum, Markus Toews, Waldemar Siemens, Felix Heilmeyer, Christian Haverkamp, Daniel Boehringer, Joerg J Meerpohl","doi":"10.1017/rsm.2025.12","DOIUrl":"10.1017/rsm.2025.12","url":null,"abstract":"<p><p>Systematic reviews are essential for evidence-based health care, but conducting them is time- and resource-consuming. To date, efforts have been made to accelerate and (semi-)automate various steps of systematic reviews through the use of artificial intelligence (AI) and the emergence of large language models (LLMs) promises further opportunities. One crucial but complex task within systematic review conduct is assessing the risk of bias (RoB) of included studies. Therefore, the aim of this study was to test the LLM Claude 2 for RoB assessment of 100 randomized controlled trials, published in English language from 2013 onwards, using the revised Cochrane risk of bias tool ('RoB 2'; involving judgements for five specific domains and an overall judgement). We assessed the agreement of RoB judgements by Claude with human judgements published in Cochrane reviews. The observed agreement between Claude and Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 ('outcome measurement'). Cohen's κ was lowest for domain 5 ('selective reporting'; 0.10 (95% confidence interval (CI): -0.10-0.31)) and highest for domain 3 ('missing data'; 0.31 (95% CI: 0.10-0.52)), indicating slight to fair agreement. Fair agreement was found for the overall judgement (Cohen's κ: 0.22 (95% CI: 0.06-0.38)). Sensitivity analyses using alternative prompting techniques or the more recent version Claude 3 did not result in substantial changes. Currently, Claude's RoB 2 judgements cannot replace human RoB assessment. However, the potential of LLMs to support RoB assessment should be further explored.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 3","pages":"491-508"},"PeriodicalIF":6.1,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527486/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}