Ilari Kuitunen, Ville T. Ponkilainen, Rasmus Liukkonen, Lauri Nyrhi, Oskari Pakarinen, Matias Vaajala, Mikko M. Uimonen
{"title":"Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments","authors":"Ilari Kuitunen, Ville T. Ponkilainen, Rasmus Liukkonen, Lauri Nyrhi, Oskari Pakarinen, Matias Vaajala, Mikko M. Uimonen","doi":"10.1111/jebm.12662","DOIUrl":null,"url":null,"abstract":"<p>Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [<span>1</span>]. A key part of the evidence synthesis is the critical appraisal of the included studies [<span>2</span>]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [<span>3, 4</span>]. Risk of bias assessments are time-consuming in evidence synthesis projects [<span>5</span>]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [<span>6-8</span>]. The interrater agreement has also shown to be varying between reviewers [<span>9</span>]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.</p><p>The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [<span>10</span>]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [<span>11, 12</span>]. One focused on ROBINS-I tool and found rather low agreement in it [<span>11</span>]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [<span>12</span>]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.</p><p>We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (<i>Lancet</i>, <i>JAMA</i> or <i>BMJ</i>). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.</p><p>Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).</p><p>A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).</p><p>Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [<span>12</span>]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.</p><p>The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [<span>13</span>], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [<span>5</span>]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.</p><p>Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.</p><p>We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":"17 4","pages":"700-702"},"PeriodicalIF":3.6000,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684499/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12662","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [1]. A key part of the evidence synthesis is the critical appraisal of the included studies [2]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [3, 4]. Risk of bias assessments are time-consuming in evidence synthesis projects [5]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [6-8]. The interrater agreement has also shown to be varying between reviewers [9]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.
The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [10]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [11, 12]. One focused on ROBINS-I tool and found rather low agreement in it [11]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [12]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.
We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (Lancet, JAMA or BMJ). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.
Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).
A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).
Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [12]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.
The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [13], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [5]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.
Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.
We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.
期刊介绍:
The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.