Hans-Peter Piepho, Laurence V. Madden, Emlyn R. Williams
Methods of network meta-analysis (NMA) can be classified as arm-based and contrast-based approaches. There are several arm-based approaches, and some of these have been criticized because they recover inter-study information and hence do not obey the principle of concurrent control. Here, we point out that recovery of inter-study information in arm-based NMA can be prevented by fitting a fixed main effect for studies. Advantages of arm-based NMA are discussed.
{"title":"The use of fixed study main effects in arm-based network meta-analysis","authors":"Hans-Peter Piepho, Laurence V. Madden, Emlyn R. Williams","doi":"10.1002/jrsm.1721","DOIUrl":"10.1002/jrsm.1721","url":null,"abstract":"<p>Methods of network meta-analysis (NMA) can be classified as arm-based and contrast-based approaches. There are several arm-based approaches, and some of these have been criticized because they recover inter-study information and hence do not obey the principle of concurrent control. Here, we point out that recovery of inter-study information in arm-based NMA can be prevented by fitting a fixed main effect for studies. Advantages of arm-based NMA are discussed.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 5","pages":"747-750"},"PeriodicalIF":5.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1721","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140896378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Population-adjusted indirect comparisons, developed in the 2010s, enable comparisons between two treatments in different studies by balancing patient characteristics in the case where individual patient-level data (IPD) are available for only one study. Health technology assessment (HTA) bodies increasingly rely on these methods to inform funding decisions, typically using unanchored indirect comparisons (i.e., without a common comparator), due to the need to evaluate comparative efficacy and safety for single-arm trials. Unanchored matching-adjusted indirect comparison (MAIC) and unanchored simulated treatment comparison (STC) are currently the only two approaches available for population-adjusted indirect comparisons based on single-arm trials. However, there is a notable underutilisation of unanchored STC in HTA, largely due to a lack of understanding of its implementation. We therefore develop a novel way to implement unanchored STC by incorporating standardisation/marginalisation and the NORmal To Anything (NORTA) algorithm for sampling covariates. This methodology aims to derive a suitable marginal treatment effect without aggregation bias for HTA evaluations. We use a non-parametric bootstrap and propose separately calculating the standard error for the IPD study and the comparator study to ensure the appropriate quantification of the uncertainty associated with the estimated treatment effect. The performance of our proposed unanchored STC approach is evaluated through a comprehensive simulation study focused on binary outcomes. Our findings demonstrate that the proposed approach is asymptotically unbiased. We argue that unanchored STC should be considered when conducting unanchored indirect comparisons with single-arm studies, presenting a robust approach for HTA decision-making.
{"title":"Advancing unanchored simulated treatment comparisons: A novel implementation and simulation study","authors":"Shijie Ren, Sa Ren, Nicky J. Welton, Mark Strong","doi":"10.1002/jrsm.1718","DOIUrl":"10.1002/jrsm.1718","url":null,"abstract":"<p>Population-adjusted indirect comparisons, developed in the 2010s, enable comparisons between two treatments in different studies by balancing patient characteristics in the case where individual patient-level data (IPD) are available for only one study. Health technology assessment (HTA) bodies increasingly rely on these methods to inform funding decisions, typically using unanchored indirect comparisons (i.e., without a common comparator), due to the need to evaluate comparative efficacy and safety for single-arm trials. Unanchored matching-adjusted indirect comparison (MAIC) and unanchored simulated treatment comparison (STC) are currently the only two approaches available for population-adjusted indirect comparisons based on single-arm trials. However, there is a notable underutilisation of unanchored STC in HTA, largely due to a lack of understanding of its implementation. We therefore develop a novel way to implement unanchored STC by incorporating standardisation/marginalisation and the NORmal To Anything (NORTA) algorithm for sampling covariates. This methodology aims to derive a suitable marginal treatment effect without aggregation bias for HTA evaluations. We use a non-parametric bootstrap and propose separately calculating the standard error for the IPD study and the comparator study to ensure the appropriate quantification of the uncertainty associated with the estimated treatment effect. The performance of our proposed unanchored STC approach is evaluated through a comprehensive simulation study focused on binary outcomes. Our findings demonstrate that the proposed approach is asymptotically unbiased. We argue that unanchored STC should be considered when conducting unanchored indirect comparisons with single-arm studies, presenting a robust approach for HTA decision-making.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"657-670"},"PeriodicalIF":5.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1718","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140579468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Some patients benefit from a treatment while others may do so less or do not benefit at all. We have previously developed a two-stage network meta-regression prediction model that synthesized randomized trials and evaluates how treatment effects vary across patient characteristics. In this article, we extended this model to combine different sources of types in different formats: aggregate data (AD) and individual participant data (IPD) from randomized and non-randomized evidence. In the first stage, a prognostic model is developed to predict the baseline risk of the outcome using a large cohort study. In the second stage, we recalibrated this prognostic model to improve our predictions for patients enrolled in randomized trials. In the third stage, we used the baseline risk as effect modifier in a network meta-regression model combining AD, IPD randomized clinical trial to estimate heterogeneous treatment effects. We illustrated the approach in the re-analysis of a network of studies comparing three drugs for relapsing–remitting multiple sclerosis. Several patient characteristics influence the baseline risk of relapse, which in turn modifies the effect of the drugs. The proposed model makes personalized predictions for health outcomes under several treatment options and encompasses all relevant randomized and non-randomized evidence.
{"title":"Combining randomized and non-randomized data to predict heterogeneous effects of competing treatments","authors":"Konstantina Chalkou, Tasnim Hamza, Pascal Benkert, Jens Kuhle, Chiara Zecca, Gabrielle Simoneau, Fabio Pellegrini, Andrea Manca, Matthias Egger, Georgia Salanti","doi":"10.1002/jrsm.1717","DOIUrl":"10.1002/jrsm.1717","url":null,"abstract":"<p>Some patients benefit from a treatment while others may do so less or do not benefit at all. We have previously developed a two-stage network meta-regression prediction model that synthesized randomized trials and evaluates how treatment effects vary across patient characteristics. In this article, we extended this model to combine different sources of types in different formats: aggregate data (AD) and individual participant data (IPD) from randomized and non-randomized evidence. In the first stage, a prognostic model is developed to predict the baseline risk of the outcome using a large cohort study. In the second stage, we recalibrated this prognostic model to improve our predictions for patients enrolled in randomized trials. In the third stage, we used the baseline risk as effect modifier in a network meta-regression model combining AD, IPD randomized clinical trial to estimate heterogeneous treatment effects. We illustrated the approach in the re-analysis of a network of studies comparing three drugs for relapsing–remitting multiple sclerosis. Several patient characteristics influence the baseline risk of relapse, which in turn modifies the effect of the drugs. The proposed model makes personalized predictions for health outcomes under several treatment options and encompasses all relevant randomized and non-randomized evidence.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"641-656"},"PeriodicalIF":5.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140157195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phi-Yen Nguyen, Joanne E. McKenzie, Simon L. Turner, Matthew J. Page, Steve McDonald
Background
Interrupted time series (ITS) studies contribute importantly to systematic reviews of population-level interventions. We aimed to develop and validate search filters to retrieve ITS studies in MEDLINE and PubMed.
Methods
A total of 1017 known ITS studies (published 2013–2017) were analysed using text mining to generate candidate terms. A control set of 1398 time-series studies were used to select differentiating terms. Various combinations of candidate terms were iteratively tested to generate three search filters. An independent set of 700 ITS studies was used to validate the filters' sensitivities. The filters were test-run in Ovid MEDLINE and the records randomly screened for ITS studies to determine their precision. Finally, all MEDLINE filters were translated to PubMed format and their sensitivities in PubMed were estimated.
Results
Three search filters were created in MEDLINE: a precision-maximising filter with high precision (78%; 95% CI 74%–82%) but moderate sensitivity (63%; 59%–66%), most appropriate when there are limited resources to screen studies; a sensitivity-and-precision-maximising filter with higher sensitivity (81%; 77%–83%) but lower precision (32%; 28%–36%), providing a balance between expediency and comprehensiveness; and a sensitivity-maximising filter with high sensitivity (88%; 85%–90%) but likely very low precision, useful when combined with specific content terms. Similar sensitivity estimates were found for PubMed versions.
Conclusion
Our filters strike different balances between comprehensiveness and screening workload and suit different research needs. Retrieval of ITS studies would be improved if authors identified the ITS design in the titles.
背景:间断时间序列(ITS)研究对人群干预的系统性综述有重要贡献。我们旨在开发并验证检索过滤器,以便在 MEDLINE 和 PubMed 中检索 ITS 研究:我们使用文本挖掘法分析了总共 1017 项已知的 ITS 研究(发表于 2013-2017 年),以生成候选术语。对照组包括 1398 项时间序列研究,用于选择差异化术语。对候选术语的各种组合进行了反复测试,以生成三个搜索过滤器。一组独立的 700 项 ITS 研究用于验证过滤器的灵敏度。筛选器在 Ovid MEDLINE 中试运行,并随机筛选 ITS 研究记录,以确定其精确度。最后,将所有 MEDLINE 筛选器翻译成 PubMed 格式,并估算其在 PubMed 中的灵敏度:结果:在 MEDLINE 中创建了三种搜索过滤器:精确度最大化过滤器,精确度高(78%;95% CI 74%-82%),但灵敏度适中(63%;59%-66%),在筛选研究的资源有限时最合适;灵敏度和精确度最大化过滤器,灵敏度较高(81%;灵敏度和精确度最大化过滤器具有较高的灵敏度(81%;77%-83%),但精确度较低(32%;28%-36%),可在便捷性和全面性之间取得平衡;灵敏度最大化过滤器具有较高的灵敏度(88%;85%-90%),但精确度可能很低,在与特定内容术语相结合时非常有用。PubMed 版本也有类似的灵敏度估计值:我们的过滤器在全面性和筛选工作量之间取得了不同的平衡,适合不同的研究需求。如果作者能在标题中标明 ITS 设计,ITS 研究的检索将得到改善。
{"title":"Development of a search filter to retrieve reports of interrupted time series studies from MEDLINE and PubMed","authors":"Phi-Yen Nguyen, Joanne E. McKenzie, Simon L. Turner, Matthew J. Page, Steve McDonald","doi":"10.1002/jrsm.1716","DOIUrl":"10.1002/jrsm.1716","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background</h3>\u0000 \u0000 <p>Interrupted time series (ITS) studies contribute importantly to systematic reviews of population-level interventions. We aimed to develop and validate search filters to retrieve ITS studies in MEDLINE and PubMed.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>A total of 1017 known ITS studies (published 2013–2017) were analysed using text mining to generate candidate terms. A control set of 1398 time-series studies were used to select differentiating terms. Various combinations of candidate terms were iteratively tested to generate three search filters. An independent set of 700 ITS studies was used to validate the filters' sensitivities. The filters were test-run in Ovid MEDLINE and the records randomly screened for ITS studies to determine their precision. Finally, all MEDLINE filters were translated to PubMed format and their sensitivities in PubMed were estimated.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>Three search filters were created in MEDLINE: a <i>precision-maximising</i> filter with high precision (78%; 95% CI 74%–82%) but moderate sensitivity (63%; 59%–66%), most appropriate when there are limited resources to screen studies; a <i>sensitivity-and-precision-maximising</i> filter with higher sensitivity (81%; 77%–83%) but lower precision (32%; 28%–36%), providing a balance between expediency and comprehensiveness; and a s<i>ensitivity-maximising</i> filter with high sensitivity (88%; 85%–90%) but likely very low precision, useful when combined with specific content terms. Similar sensitivity estimates were found for PubMed versions.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>Our filters strike different balances between comprehensiveness and screening workload and suit different research needs. Retrieval of ITS studies would be improved if authors identified the ITS design in the titles.</p>\u0000 </section>\u0000 </div>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"627-640"},"PeriodicalIF":5.0,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1716","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140142443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, Kristin Hadfield
Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a “human-out-of-the-loop” approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching “human-like” levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.
{"title":"Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages","authors":"Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, Kristin Hadfield","doi":"10.1002/jrsm.1715","DOIUrl":"10.1002/jrsm.1715","url":null,"abstract":"<p>Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a “human-out-of-the-loop” approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching “human-like” levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"616-626"},"PeriodicalIF":5.0,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1715","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140130308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meta-analysis is a useful tool in clinical research, as it combines the results of multiple clinical studies to improve precision when answering a particular scientific question. While there has been a substantial increase in publications using meta-analysis in various clinical research topics, the number of published meta-analyses in metabolomics is significantly lower compared to other omics disciplines. Metabolomics is the study of small chemical compounds in living organisms, which provides important insights into an organism's phenotype. However, the wide variety of compounds and the different experimental methods used in metabolomics make it challenging to perform a thorough meta-analysis. Additionally, there is a lack of consensus on reporting statistical estimates, and the high number of compound naming synonyms further complicates the process. Easy-Amanida is a new tool that combines two R packages, “amanida” and “webchem”, to enable meta-analysis of aggregate statistical data, like p-value and fold-change, while ensuring the compounds naming harmonization. The Easy-Amanida app is implemented in Shiny, an R package add-on for interactive web apps, and provides a workflow to optimize the naming combination. This article describes all the steps to perform the meta-analysis using Easy-Amanida, including an illustrative example for interpreting the results. The use of aggregate statistics metrics extends the use of Easy-Amanida beyond the metabolomics field.
在临床研究中,荟萃分析是一种有用的工具,因为它能将多项临床研究的结果结合起来,从而在回答特定科学问题时提高精确度。虽然在各种临床研究课题中使用荟萃分析的论文数量大幅增加,但与其他 omics 学科相比,代谢组学领域发表的荟萃分析论文数量明显较少。代谢组学研究生物体内的小分子化合物,为了解生物体的表型提供重要依据。然而,代谢组学中使用的化合物种类繁多,实验方法各不相同,因此要进行全面的荟萃分析具有挑战性。此外,在报告统计估计值方面也缺乏共识,而化合物命名同义词的大量存在又使这一过程变得更加复杂。Easy-Amanida 是一款新工具,它结合了两个 R 软件包 "amanida "和 "webchem",能够对 P 值和折叠变化等总体统计数据进行荟萃分析,同时确保化合物命名的统一。Easy-Amanida 应用程序是在 Shiny(一种用于交互式网络应用程序的 R 包插件)中实现的,并提供了优化命名组合的工作流程。本文介绍了使用 Easy-Amanida 进行荟萃分析的所有步骤,包括一个解释结果的示例。汇总统计指标的使用将 Easy-Amanida 的用途扩展到代谢组学领域之外。
{"title":"Easy-Amanida: An R Shiny application for the meta-analysis of aggregate results in clinical metabolomics using Amanida and Webchem","authors":"Maria Llambrich, Pau Satorra, Eudald Correig, Josep Gumà, Jesús Brezmes, Cristian Tebé, Raquel Cumeras","doi":"10.1002/jrsm.1713","DOIUrl":"10.1002/jrsm.1713","url":null,"abstract":"<p>Meta-analysis is a useful tool in clinical research, as it combines the results of multiple clinical studies to improve precision when answering a particular scientific question. While there has been a substantial increase in publications using meta-analysis in various clinical research topics, the number of published meta-analyses in metabolomics is significantly lower compared to other omics disciplines. Metabolomics is the study of small chemical compounds in living organisms, which provides important insights into an organism's phenotype. However, the wide variety of compounds and the different experimental methods used in metabolomics make it challenging to perform a thorough meta-analysis. Additionally, there is a lack of consensus on reporting statistical estimates, and the high number of compound naming synonyms further complicates the process. Easy-Amanida is a new tool that combines two R packages, “amanida” and “webchem”, to enable meta-analysis of aggregate statistical data, like <i>p</i>-value and fold-change, while ensuring the compounds naming harmonization. The Easy-Amanida app is implemented in Shiny, an R package add-on for interactive web apps, and provides a workflow to optimize the naming combination. This article describes all the steps to perform the meta-analysis using Easy-Amanida, including an illustrative example for interpreting the results. The use of aggregate statistics metrics extends the use of Easy-Amanida beyond the metabolomics field.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"687-699"},"PeriodicalIF":5.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1713","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140118334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The LFK index has been promoted as an improved method to detect bias in meta-analysis. Putatively, its performance does not depend on the number of studies in the meta-analysis. We conducted a simulation study, comparing the LFK index test to three standard tests for funnel plot asymmetry in settings with smaller or larger group sample sizes. In general, false positive rates of the LFK index test markedly depended on the number and size of studies as well as the between-study heterogeneity with values between 0% and almost 30%. Egger's test adhered well to the pre-specified significance level of 5% under homogeneity, but was too liberal (smaller groups) or conservative (larger groups) under heterogeneity. The rank test was too conservative for most simulation scenarios. The Thompson–Sharp test was too conservative under homogeneity, but adhered well to the significance level in case of heterogeneity. The true positive rate of the LFK index test was only larger compared with classic tests if the false positive rate was inflated. The power of classic tests was similar or larger than the LFK index test if the false positive rate of the LFK index test was used as significance level for the classic tests. Under ideal conditions, the false positive rate of the LFK index test markedly and unpredictably depends on the number and sample size of studies as well as the extent of between-study heterogeneity. The LFK index test in its current implementation should not be used to assess funnel plot asymmetry in meta-analysis.
{"title":"LFK index does not reliably detect small-study effects in meta-analysis: A simulation study","authors":"Guido Schwarzer, Gerta Rücker, Cristina Semaca","doi":"10.1002/jrsm.1714","DOIUrl":"10.1002/jrsm.1714","url":null,"abstract":"<p>The <i>LFK</i> index has been promoted as an improved method to detect bias in meta-analysis. Putatively, its performance does not depend on the number of studies in the meta-analysis. We conducted a simulation study, comparing the <i>LFK</i> index test to three standard tests for funnel plot asymmetry in settings with smaller or larger group sample sizes. In general, false positive rates of the <i>LFK</i> index test markedly depended on the number and size of studies as well as the between-study heterogeneity with values between 0% and almost 30%. Egger's test adhered well to the pre-specified significance level of 5% under homogeneity, but was too liberal (smaller groups) or conservative (larger groups) under heterogeneity. The rank test was too conservative for most simulation scenarios. The Thompson–Sharp test was too conservative under homogeneity, but adhered well to the significance level in case of heterogeneity. The true positive rate of the <i>LFK</i> index test was only larger compared with classic tests if the false positive rate was inflated. The power of classic tests was similar or larger than the <i>LFK</i> index test if the false positive rate of the <i>LFK</i> index test was used as significance level for the classic tests. Under ideal conditions, the false positive rate of the <i>LFK</i> index test markedly and unpredictably depends on the number and sample size of studies as well as the extent of between-study heterogeneity. The <i>LFK</i> index test in its current implementation should not be used to assess funnel plot asymmetry in meta-analysis.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"603-615"},"PeriodicalIF":5.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1714","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140100608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ofir Harari, Mohsen Soltanifar, Joseph C. Cappelleri, Andre Verhoek, Mario Ouwens, Caitlin Daly, and Bart Heeg (2023) Network Meta-Interpolation: Effect modification adjustment in network meta-analysis using subgroup analyses. Research Synthesis Methods, 14: 211–233.
This is also the version that appears in the R code provided to the readers as part of the Supporting Information.
We apologize for these errors.
Ofir Harari、Mohsen Soltanifar、Joseph C. Cappelleri、Andre Verhoek、Mario Ouwens、Caitlin Daly 和 Bart Heeg (2023) Network Meta-Interpolation:使用亚组分析对网络荟萃分析中的效应修正进行调整。Research Synthesis Methods, 14: 211-233.This is also the version that appears in the R code provided to the readers as part of the Supporting Information.我们对这些错误表示歉意。
{"title":"Correction to “Network Meta-Interpolation: Effect modification adjustment in network meta-analysis using subgroup analyses”","authors":"","doi":"10.1002/jrsm.1712","DOIUrl":"10.1002/jrsm.1712","url":null,"abstract":"<p>\u0000 <span>Ofir Harari</span>, <span>Mohsen Soltanifar</span>, <span>Joseph C. Cappelleri</span>, <span>Andre Verhoek</span>, <span>Mario Ouwens</span>, <span>Caitlin Daly</span>, and <span>Bart Heeg</span> (<span>2023</span>) <span>Network Meta-Interpolation: Effect modification adjustment in network meta-analysis using subgroup analyses</span>. <i>Research Synthesis Methods</i>, <span>14</span>: <span>211</span>–<span>233</span>.</p><p>This is also the version that appears in the R code provided to the readers as part of the Supporting Information.</p><p>We apologize for these errors.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 2","pages":"369"},"PeriodicalIF":9.8,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1712","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140020432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gerald Gartlehner, Leila Kahwati, Rainer Hilscher, Ian Thomas, Shannon Kugley, Karen Crotty, Meera Viswanathan, Barbara Nussbaumer-Streit, Graham Booth, Nathaniel Erskine, Amanda Konet, Robert Chew
Data extraction is a crucial, yet labor-intensive and error-prone part of evidence synthesis. To date, efforts to harness machine learning for enhancing efficiency of the data extraction process have fallen short of achieving sufficient accuracy and usability. With the release of large language models (LLMs), new possibilities have emerged to increase efficiency and accuracy of data extraction for evidence synthesis. The objective of this proof-of-concept study was to assess the performance of an LLM (Claude 2) in extracting data elements from published studies, compared with human data extraction as employed in systematic reviews. Our analysis utilized a convenience sample of 10 English-language, open-access publications of randomized controlled trials included in a single systematic review. We selected 16 distinct types of data, posing varying degrees of difficulty (160 data elements across 10 studies). We used the browser version of Claude 2 to upload the portable document format of each publication and then prompted the model for each data element. Across 160 data elements, Claude 2 demonstrated an overall accuracy of 96.3% with a high test–retest reliability (replication 1: 96.9%; replication 2: 95.0% accuracy). Overall, Claude 2 made 6 errors on 160 data items. The most common errors (n = 4) were missed data items. Importantly, Claude 2's ease of use was high; it required no technical expertise or labeled training data for effective operation (i.e., zero-shot learning). Based on findings of our proof-of-concept study, leveraging LLMs has the potential to substantially enhance the efficiency and accuracy of data extraction for evidence syntheses.
{"title":"Data extraction for evidence synthesis using a large language model: A proof-of-concept study","authors":"Gerald Gartlehner, Leila Kahwati, Rainer Hilscher, Ian Thomas, Shannon Kugley, Karen Crotty, Meera Viswanathan, Barbara Nussbaumer-Streit, Graham Booth, Nathaniel Erskine, Amanda Konet, Robert Chew","doi":"10.1002/jrsm.1710","DOIUrl":"10.1002/jrsm.1710","url":null,"abstract":"<p>Data extraction is a crucial, yet labor-intensive and error-prone part of evidence synthesis. To date, efforts to harness machine learning for enhancing efficiency of the data extraction process have fallen short of achieving sufficient accuracy and usability. With the release of large language models (LLMs), new possibilities have emerged to increase efficiency and accuracy of data extraction for evidence synthesis. The objective of this proof-of-concept study was to assess the performance of an LLM (Claude 2) in extracting data elements from published studies, compared with human data extraction as employed in systematic reviews. Our analysis utilized a convenience sample of 10 English-language, open-access publications of randomized controlled trials included in a single systematic review. We selected 16 distinct types of data, posing varying degrees of difficulty (160 data elements across 10 studies). We used the browser version of Claude 2 to upload the portable document format of each publication and then prompted the model for each data element. Across 160 data elements, Claude 2 demonstrated an overall accuracy of 96.3% with a high test–retest reliability (replication 1: 96.9%; replication 2: 95.0% accuracy). Overall, Claude 2 made 6 errors on 160 data items. The most common errors (<i>n</i> = 4) were missed data items. Importantly, Claude 2's ease of use was high; it required no technical expertise or labeled training data for effective operation (i.e., zero-shot learning). Based on findings of our proof-of-concept study, leveraging LLMs has the potential to substantially enhance the efficiency and accuracy of data extraction for evidence syntheses.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"576-589"},"PeriodicalIF":5.0,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1710","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140020433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephan B. Bruns, Teshome K. Deressa, T. D. Stanley, Chris Doucouliagos, John P. A. Ioannidis
Using a sample of 70,399 published p-values from 192 meta-analyses, we empirically estimate the counterfactual distribution of p-values in the absence of any biases. Comparing observed p-values with counterfactually expected p-values allows us to estimate how many p-values are published as being statistically significant when they should have been published as non-significant. We estimate the extent of selectively reported p-values to range between 57.7% and 71.9% of the significant p-values. The counterfactual p-value distribution also allows us to assess shifts of p-values along the entire distribution of published p-values, revealing that particularly very small p-values (p < 0.001) are unexpectedly abundant in the published literature. Subsample analysis suggests that the extent of selective reporting is reduced in research fields that use experimental designs, analyze microeconomics research questions, and have at least some adequately powered studies.
我们使用 192 项元分析中 70,399 个已发表 p 值的样本,根据经验估算了在没有任何偏差的情况下 p 值的反事实分布。将观察到的 p 值与反事实预期的 p 值进行比较,我们就能估算出有多少 p 值在本应作为非显著性发表的情况下却被作为具有统计学意义的 p 值发表了。我们估计选择性报告的 p 值占显著 p 值的 57.7% 到 71.9%。通过反事实 p 值分布,我们还可以评估 p 值在整个已公布 p 值分布中的移动情况。
{"title":"Estimating the extent of selective reporting: An application to economics","authors":"Stephan B. Bruns, Teshome K. Deressa, T. D. Stanley, Chris Doucouliagos, John P. A. Ioannidis","doi":"10.1002/jrsm.1711","DOIUrl":"10.1002/jrsm.1711","url":null,"abstract":"<p>Using a sample of 70,399 published <i>p</i>-values from 192 meta-analyses, we empirically estimate the counterfactual distribution of <i>p</i>-values in the absence of any biases. Comparing observed <i>p</i>-values with counterfactually expected <i>p</i>-values allows us to estimate how many <i>p</i>-values are published as being statistically significant when they should have been published as non-significant. We estimate the extent of selectively reported <i>p</i>-values to range between 57.7% and 71.9% of the significant <i>p</i>-values. The counterfactual <i>p</i>-value distribution also allows us to assess shifts of <i>p</i>-values along the entire distribution of published <i>p</i>-values, revealing that particularly very small <i>p</i>-values (<i>p</i> < 0.001) are unexpectedly abundant in the published literature. Subsample analysis suggests that the extent of selective reporting is reduced in research fields that use experimental designs, analyze microeconomics research questions, and have at least some adequately powered studies.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 4","pages":"590-602"},"PeriodicalIF":5.0,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1711","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139911665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}