Pub Date : 2025-11-01Epub Date: 2025-07-22DOI: 10.1017/rsm.2025.10025
Isa Spiero, Artuur M Leeuwenberg, Karel G M Moons, Lotty Hooft, Johanna A A Damen
Systematic reviews (SRs) synthesize evidence through a rigorous, labor-intensive, and costly process. To accelerate the title-abstract screening phase of SRs, several artificial intelligence (AI)-based semi-automated screening tools have been developed to reduce workload by prioritizing relevant records. However, their performance is primarily evaluated for SRs of intervention studies, which generally have well-structured abstracts. Here, we evaluate whether screening tool performance is equally effective for SRs of prognosis studies that have larger heterogeneity between abstracts. We conducted retrospective simulations on prognosis and intervention reviews using a screening tool (ASReview). We also evaluated the effects of review scope (i.e., breadth of the research question), number of (relevant) records, and modeling methods within the tool. Performance was assessed in terms of recall (i.e., sensitivity), precision at 95% recall (i.e., positive predictive value at 95% recall), and workload reduction (work saved over sampling at 95% recall [WSS@95%]). The WSS@95% was slightly worse for prognosis reviews (range: 0.324-0.597) than for intervention reviews (range: 0.613-0.895). The precision was higher for prognosis (range: 0.115-0.400) compared to intervention reviews (range: 0.024-0.057). These differences were primarily due to the larger number of relevant records in the prognosis reviews. The modeling methods and the scope of the prognosis review did not significantly impact tool performance. We conclude that the larger abstract heterogeneity of prognosis studies does not substantially affect the effectiveness of screening tools for SRs of prognosis. Further evaluation studies including a standardized evaluation framework are needed to enable prospective decisions on the reliable use of screening tools.
{"title":"Evaluation of semi-automated record screening methods for systematic reviews of prognosis studies and intervention studies.","authors":"Isa Spiero, Artuur M Leeuwenberg, Karel G M Moons, Lotty Hooft, Johanna A A Damen","doi":"10.1017/rsm.2025.10025","DOIUrl":"10.1017/rsm.2025.10025","url":null,"abstract":"<p><p>Systematic reviews (SRs) synthesize evidence through a rigorous, labor-intensive, and costly process. To accelerate the title-abstract screening phase of SRs, several artificial intelligence (AI)-based semi-automated screening tools have been developed to reduce workload by prioritizing relevant records. However, their performance is primarily evaluated for SRs of intervention studies, which generally have well-structured abstracts. Here, we evaluate whether screening tool performance is equally effective for SRs of prognosis studies that have larger heterogeneity between abstracts. We conducted retrospective simulations on prognosis and intervention reviews using a screening tool (ASReview). We also evaluated the effects of review scope (i.e., breadth of the research question), number of (relevant) records, and modeling methods within the tool. Performance was assessed in terms of recall (i.e., sensitivity), precision at 95% recall (i.e., positive predictive value at 95% recall), and workload reduction (work saved over sampling at 95% recall [WSS@95%]). The WSS@95% was slightly worse for prognosis reviews (range: 0.324-0.597) than for intervention reviews (range: 0.613-0.895). The precision was higher for prognosis (range: 0.115-0.400) compared to intervention reviews (range: 0.024-0.057). These differences were primarily due to the larger number of relevant records in the prognosis reviews. The modeling methods and the scope of the prognosis review did not significantly impact tool performance. We conclude that the larger abstract heterogeneity of prognosis studies does not substantially affect the effectiveness of screening tools for SRs of prognosis. Further evaluation studies including a standardized evaluation framework are needed to enable prospective decisions on the reliable use of screening tools.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"975-989"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-08-28DOI: 10.1017/rsm.2025.10023
Klas Moberg, Carl Gornitzki
Our objective was to evaluate the recall and number needed to read (NNR) for the Cochrane RCT Classifier compared to and in combination with established search filters developed for Ovid MEDLINE and Embase.com. A gold standard set of 1,103 randomized controlled trials (RCTs) was created to calculate recall for the Cochrane RCT Classifier in Covidence, the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE and the Cochrane Embase RCT filter for Embase.com. In addition, the classifier and the filters were validated in three case studies using reports from the Swedish Agency for Health Technology Assessment and Assessment of Social Services to assess impact on search results and NNR. The Cochrane RCT Classifier had the highest recall with 99.64% followed by the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE with 98.73% and the Cochrane Embase RCT filter with 98.46%. However, the Cochrane RCT Classifier had a higher NNR than the RCT filters in all case studies. Combining the RCT filters with the Cochrane RCT Classifier reduced NNR compared to using the RCT filters alone while achieving a recall of 98.46% for the Ovid MEDLINE/RCT Classifier combination and 98.28% for the Embase/RCT Classifier combination. In conclusion, we found that the Cochrane RCT Classifier in Covidence has a higher recall than established search filters but also a higher NNR. Thus, using the Cochrane RCT Classifier instead of current state-of-the-art RCT filters would lead to an increased workload in the screening process. A viable option with a lower NNR than RCT filters, at the cost of a slight decrease in recall, is to combine the Cochrane RCT Classifier with RCT filters in database searches.
{"title":"Combining search filters for randomized controlled trials with the Cochrane RCT Classifier in Covidence: a methodological validation study.","authors":"Klas Moberg, Carl Gornitzki","doi":"10.1017/rsm.2025.10023","DOIUrl":"10.1017/rsm.2025.10023","url":null,"abstract":"<p><p>Our objective was to evaluate the recall and number needed to read (NNR) for the Cochrane RCT Classifier compared to and in combination with established search filters developed for Ovid MEDLINE and Embase.com. A gold standard set of 1,103 randomized controlled trials (RCTs) was created to calculate recall for the Cochrane RCT Classifier in Covidence, the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE and the Cochrane Embase RCT filter for Embase.com. In addition, the classifier and the filters were validated in three case studies using reports from the Swedish Agency for Health Technology Assessment and Assessment of Social Services to assess impact on search results and NNR. The Cochrane RCT Classifier had the highest recall with 99.64% followed by the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE with 98.73% and the Cochrane Embase RCT filter with 98.46%. However, the Cochrane RCT Classifier had a higher NNR than the RCT filters in all case studies. Combining the RCT filters with the Cochrane RCT Classifier reduced NNR compared to using the RCT filters alone while achieving a recall of 98.46% for the Ovid MEDLINE/RCT Classifier combination and 98.28% for the Embase/RCT Classifier combination. In conclusion, we found that the Cochrane RCT Classifier in Covidence has a higher recall than established search filters but also a higher NNR. Thus, using the Cochrane RCT Classifier instead of current state-of-the-art RCT filters would lead to an increased workload in the screening process. A viable option with a lower NNR than RCT filters, at the cost of a slight decrease in recall, is to combine the Cochrane RCT Classifier with RCT filters in database searches.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"953-960"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657657/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In meta-analyses of survival rates, precision information (i.e., standard errors (SEs) or confidence intervals) are often missing in clinical studies. In current practice, such studies are often excluded from the synthesis analyses. However, the naïve deletion of these incomplete data can produce serious biases and loss of precision in pooled estimators. To address these issues, we developed a simple but effective method to impute precision information using commonly available statistics from individual studies, such as sample size, number of events, and risk set size at a time point of interest. By applying this new method, we can effectively circumvent the deletion of incomplete data, resultant biases, and losses of precision. Based on extensive simulation studies, the developed method markedly improves the accuracy and precision of the pooled estimators compared to those of naïve analyses that delete studies with missing precision. Furthermore, the performance of the proposed method was not significantly inferior to the ideal case, where there was no missing precision information. However, for studies for which the risk set size at the time of interest was not available, the proposed method runs the risk of overestimating the SE. Although the proposed method is a single-imputation method, the simulations show that there is no underestimation bias of the SE, even though the proposed method does not consider the uncertainty of missing values. To demonstrate the robustness of our proposed methods, they were applied in a systematic review of radiotherapy data. An R package was developed to implement the proposed procedure.
{"title":"Simple imputation method for meta-analysis of survival rates when precision information is missing.","authors":"Kazushi Maruo, Yusuke Yamaguchi, Ryota Ishii, Hisashi Noma, Masahiko Gosho","doi":"10.1017/rsm.2025.10024","DOIUrl":"10.1017/rsm.2025.10024","url":null,"abstract":"<p><p>In meta-analyses of survival rates, precision information (i.e., standard errors (SEs) or confidence intervals) are often missing in clinical studies. In current practice, such studies are often excluded from the synthesis analyses. However, the naïve deletion of these incomplete data can produce serious biases and loss of precision in pooled estimators. To address these issues, we developed a simple but effective method to impute precision information using commonly available statistics from individual studies, such as sample size, number of events, and risk set size at a time point of interest. By applying this new method, we can effectively circumvent the deletion of incomplete data, resultant biases, and losses of precision. Based on extensive simulation studies, the developed method markedly improves the accuracy and precision of the pooled estimators compared to those of naïve analyses that delete studies with missing precision. Furthermore, the performance of the proposed method was not significantly inferior to the ideal case, where there was no missing precision information. However, for studies for which the risk set size at the time of interest was not available, the proposed method runs the risk of overestimating the SE. Although the proposed method is a single-imputation method, the simulations show that there is no underestimation bias of the SE, even though the proposed method does not consider the uncertainty of missing values. To demonstrate the robustness of our proposed methods, they were applied in a systematic review of radiotherapy data. An R package was developed to implement the proposed procedure.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"937-952"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657670/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-10DOI: 10.1017/rsm.2025.10034
Viet-Thi Tran, Carolina Grana Possamai, Isabelle Boutron, Philippe Ravaud
A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword "diabetes." We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.
系统评价的关键步骤包括定义搜索策略,使用关键词和布尔逻辑来过滤电子数据库。我们假设可以使用大型语言模型(llm)作为搜索方程的替代方案来筛选电子数据库中的文章。为了研究这个问题,我们比较了在电子数据库中识别随机对照试验(rct)的两种方法:使用Cochrane高敏感搜索筛选数据库和由法学硕士评估。我们检索了PubMed索引中发表日期在2024年9月1日至9月30日之间的研究,使用唯一的关键词“糖尿病”。我们比较了Cochrane高敏感检索的性能和通过gpt - 40 -mini直接从数据库中提取的所有标题和摘要的评估,以确定rct。参考标准是由两名独立审稿人对检索到的文章进行人工筛选。检索到6377条记录,其中210条(3.5%)为rct的主要报告。Cochrane高灵敏度搜索过滤了2197条记录,遗漏了1项RCT(灵敏度99.5%,95% CI 97.4% ~ 100%;特异性67.8%,95% CI 66.6% ~ 68.9%)。通过GPT对电子数据库中的所有标题和摘要进行评估,筛选了1080条记录,并纳入了所有210篇rct的主要报告(敏感性100%,95% CI 98.3%至100%;特异性85.9%,95% CI 85.0%至86.8%)。法学硕士可以筛选电子数据库中的所有文章,以确定rct作为Cochrane高灵敏度搜索的替代方法。这就要求对法学硕士进行评估,以替代严格的搜索策略。
{"title":"Using large language models to directly screen electronic databases as an alternative to traditional search strategies such as the Cochrane highly sensitive search for filtering randomized controlled trials in systematic reviews.","authors":"Viet-Thi Tran, Carolina Grana Possamai, Isabelle Boutron, Philippe Ravaud","doi":"10.1017/rsm.2025.10034","DOIUrl":"10.1017/rsm.2025.10034","url":null,"abstract":"<p><p>A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword \"diabetes.\" We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"1035-1041"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657644/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-08-07DOI: 10.1017/rsm.2025.10027
Ronny Scherer, Diego G Campos
To synthesize evidence on the relations among multiple constructs, measures, or concepts, meta-analyzing correlation matrices across primary studies has become a crucial analytic approach. Common meta-analytic approaches employ univariate or multivariate models to estimate a pooled correlation matrix, which is subjected to further analyses, such as structural equation modeling. In practice, meta-analysts often extract multiple correlation matrices per study from various samples, study sites, labs, or countries, thus introducing hierarchical effect size multiplicity into the meta-analytic data. However, this feature has largely been ignored when pooling correlation matrices for meta-analysis. To contribute to the methodological development in this area, we describe a multilevel, multivariate, and random-effects modeling approach, which pools correlation matrices meta-analytically and, at the same time, addresses hierarchical effect size multiplicity. Specifically, it allows meta-analysts to test various assumptions on the dependencies among random effects, aiding the selection of a meta-analytic baseline model. We describe this approach, present four working models within it, and illustrate them with an example and the corresponding R code.
{"title":"Meta-analyzing correlation matrices in the presence of hierarchical effect size multiplicity.","authors":"Ronny Scherer, Diego G Campos","doi":"10.1017/rsm.2025.10027","DOIUrl":"10.1017/rsm.2025.10027","url":null,"abstract":"<p><p>To synthesize evidence on the relations among multiple constructs, measures, or concepts, meta-analyzing correlation matrices across primary studies has become a crucial analytic approach. Common meta-analytic approaches employ univariate or multivariate models to estimate a pooled correlation matrix, which is subjected to further analyses, such as structural equation modeling. In practice, meta-analysts often extract multiple correlation matrices per study from various samples, study sites, labs, or countries, thus introducing hierarchical effect size multiplicity into the meta-analytic data. However, this feature has largely been ignored when pooling correlation matrices for meta-analysis. To contribute to the methodological development in this area, we describe a multilevel, multivariate, and random-effects modeling approach, which pools correlation matrices meta-analytically and, at the same time, addresses hierarchical effect size multiplicity. Specifically, it allows meta-analysts to test various assumptions on the dependencies among random effects, aiding the selection of a meta-analytic baseline model. We describe this approach, present four working models within it, and illustrate them with an example and the corresponding R code.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"828-858"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657669/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-07-10DOI: 10.1017/rsm.2025.10021
Chengyang Gao, Anna Heath, Gianluca Baio
Background: Understanding the relative costs and effectiveness of all competing interventions is crucial to informing health resource allocations. However, to receive regulatory approval for efficacy, novel pharmaceuticals are typically only compared against placebo or standard of care. The relative efficacy against the best alternative intervention relies on indirect comparisons of different interventions. When treatment effect modifiers are distributed differently across trials, population adjustment is necessary to ensure a fair comparison. Matching-Adjusted Indirect Comparisons (MAIC) is the most widely adopted weighting-based method for this purpose. Nevertheless, MAIC can exhibit instability under poor population overlap. Regression-based approaches to overcome this issue are heavily dependent on parametric assumptions.
Methods: We introduce a novel method, 'G-MAIC,' which combines outcome regression and weighting-adjustment to address these limitations. Inspired by Bayesian survey inference, G-MAIC employs Bayesian bootstrap to propagate the uncertainty of population-adjusted estimates. We evaluate the performance of G-MAIC against standard non-adjusted methods, MAIC and Parametric G-computation, in a simulation study encompassing 18 scenarios with varying trial sample sizes, population overlaps, and covariate structures.
Results: Under poor overlap and small sample sizes, MAIC can produce non-sensible variance estimations or increased bias compared to non-adjusted methods, depending on covariate structures in the two trials compared. G-MAIC mitigates this issue, achieving comparable performance to parametric G-computation with reduced reliance on parametric assumptions.
Conclusion: G-MAIC presents a robust alternative to the widely adopted MAIC for population-adjusted indirect comparisons. The underlying framework is flexible such that it can accommodate advanced nonparametric outcome models and alternative weighting schemes.
{"title":"Regression augmented weighting adjustment for indirect comparisons in health decision modelling.","authors":"Chengyang Gao, Anna Heath, Gianluca Baio","doi":"10.1017/rsm.2025.10021","DOIUrl":"10.1017/rsm.2025.10021","url":null,"abstract":"<p><strong>Background: </strong>Understanding the relative costs and effectiveness of all competing interventions is crucial to informing health resource allocations. However, to receive regulatory approval for efficacy, novel pharmaceuticals are typically only compared against placebo or standard of care. The relative efficacy against the best alternative intervention relies on indirect comparisons of different interventions. When treatment effect modifiers are distributed differently across trials, population adjustment is necessary to ensure a fair comparison. Matching-Adjusted Indirect Comparisons (MAIC) is the most widely adopted weighting-based method for this purpose. Nevertheless, MAIC can exhibit instability under poor population overlap. Regression-based approaches to overcome this issue are heavily dependent on parametric assumptions.</p><p><strong>Methods: </strong>We introduce a novel method, 'G-MAIC,' which combines outcome regression and weighting-adjustment to address these limitations. Inspired by Bayesian survey inference, G-MAIC employs Bayesian bootstrap to propagate the uncertainty of population-adjusted estimates. We evaluate the performance of G-MAIC against standard non-adjusted methods, MAIC and Parametric G-computation, in a simulation study encompassing 18 scenarios with varying trial sample sizes, population overlaps, and covariate structures.</p><p><strong>Results: </strong>Under poor overlap and small sample sizes, MAIC can produce non-sensible variance estimations or increased bias compared to non-adjusted methods, depending on covariate structures in the two trials compared. G-MAIC mitigates this issue, achieving comparable performance to parametric G-computation with reduced reliance on parametric assumptions.</p><p><strong>Conclusion: </strong>G-MAIC presents a robust alternative to the widely adopted MAIC for population-adjusted indirect comparisons. The underlying framework is flexible such that it can accommodate advanced nonparametric outcome models and alternative weighting schemes.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"900-921"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657667/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-08-07DOI: 10.1017/rsm.2025.10028
Danni Xia, Honghao Lai, Weilong Zhao, Jiajie Huang, Jiayi Liu, Ziying Ye, Jianing Liu, Mingyao Sun, Liangying Hou, Bei Pan, Long Ge
This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and F1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen's kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%-83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o's 0.55, P < 0.001). In terms of F1 scores, Moonshot-v1-128k led in population selection (F = 0.80 versus ChatGPT-4o's 0.67, P = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47-50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.
{"title":"Assessing risk of bias of cohort studies with large language models.","authors":"Danni Xia, Honghao Lai, Weilong Zhao, Jiajie Huang, Jiayi Liu, Ziying Ye, Jianing Liu, Mingyao Sun, Liangying Hou, Bei Pan, Long Ge","doi":"10.1017/rsm.2025.10028","DOIUrl":"10.1017/rsm.2025.10028","url":null,"abstract":"<p><p>This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and <i>F</i>1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen's kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%-83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o's 0.55, <i>P</i> < 0.001). In terms of <i>F</i>1 scores, Moonshot-v1-128k led in population selection (<i>F</i> = 0.80 versus ChatGPT-4o's 0.67, <i>P</i> = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47-50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"990-1004"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657654/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-07-10DOI: 10.1017/rsm.2025.10022
Yajie Duan, Thomas Mathew, Demissie Alemayehu, Ge Cheng
Random-effects meta-analyses with only a few studies often face challenges in accurately estimating between-study heterogeneity, leading to biased effect estimates and confidence intervals with poor coverage. This issue is especially the case when dealing with rare diseases. To address this problem for normally distributed outcomes, two new approaches have been proposed to provide confidence limits of the global mean: one based on fiducial inference, and the other involving two modifications of the signed log-likelihood ratio test statistic in order to have improved performance with small numbers of studies. The performance of the proposed methods was evaluated numerically and compared with the Hartung-Knapp-Sidik-Jonkman approach and its modification to handle small numbers of studies. The simulation results indicated that the proposed methods achieved coverage probabilities closer to the nominal level and produced shorter confidence intervals compared to those based on existing methods. Two real examples are used to illustrate the proposed methods.
{"title":"Novel approaches for random-effects meta-analysis of a small number of studies under normality.","authors":"Yajie Duan, Thomas Mathew, Demissie Alemayehu, Ge Cheng","doi":"10.1017/rsm.2025.10022","DOIUrl":"10.1017/rsm.2025.10022","url":null,"abstract":"<p><p>Random-effects meta-analyses with only a few studies often face challenges in accurately estimating between-study heterogeneity, leading to biased effect estimates and confidence intervals with poor coverage. This issue is especially the case when dealing with rare diseases. To address this problem for normally distributed outcomes, two new approaches have been proposed to provide confidence limits of the global mean: one based on fiducial inference, and the other involving two modifications of the signed log-likelihood ratio test statistic in order to have improved performance with small numbers of studies. The performance of the proposed methods was evaluated numerically and compared with the Hartung-Knapp-Sidik-Jonkman approach and its modification to handle small numbers of studies. The simulation results indicated that the proposed methods achieved coverage probabilities closer to the nominal level and produced shorter confidence intervals compared to those based on existing methods. Two real examples are used to illustrate the proposed methods.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 6","pages":"922-936"},"PeriodicalIF":6.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657671/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01Epub Date: 2025-06-23DOI: 10.1017/rsm.2025.10019
Gena Nelson, Sarah Quinn, Sean Grant, Shaina D Trevino, Elizabeth Day, Maria Schweer-Collins, Hannah Carter, Peter Boedeker, Emily Tanner-Smith
Study coding is an essential component of the research synthesis process. Data extracted during study coding serve as a direct link between the included studies and the synthesis results, allowing reviewers to justify claims about the findings from a set of related studies. The purpose of this tutorial is to provide authors, particularly those new to research synthesis, with recommendations to develop study coding manuals and forms that result in efficient, high-quality data extraction. Each of the 10 easy-to-follow practices is supported with additional resources, examples, or non-examples to help authors develop high-quality study coding materials. With the increase in publication of meta-analyses in recent years across many disciplines, a primary goal of this article is to enhance the quality of study coding materials that authors develop.
{"title":"Ten practices for successful study coding in research syntheses: Developing coding manuals and coding forms.","authors":"Gena Nelson, Sarah Quinn, Sean Grant, Shaina D Trevino, Elizabeth Day, Maria Schweer-Collins, Hannah Carter, Peter Boedeker, Emily Tanner-Smith","doi":"10.1017/rsm.2025.10019","DOIUrl":"10.1017/rsm.2025.10019","url":null,"abstract":"<p><p>Study coding is an essential component of the research synthesis process. Data extracted during study coding serve as a direct link between the included studies and the synthesis results, allowing reviewers to justify claims about the findings from a set of related studies. The purpose of this tutorial is to provide authors, particularly those new to research synthesis, with recommendations to develop study coding manuals and forms that result in efficient, high-quality data extraction. Each of the 10 easy-to-follow practices is supported with additional resources, examples, or non-examples to help authors develop high-quality study coding materials. With the increase in publication of meta-analyses in recent years across many disciplines, a primary goal of this article is to enhance the quality of study coding materials that authors develop.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"16 5","pages":"709-728"},"PeriodicalIF":6.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527492/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}