Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments

IF 3.6 2区 医学 Q1 MEDICINE, GENERAL & INTERNAL Journal of Evidence‐Based Medicine Pub Date : 2024-12-15 DOI:10.1111/jebm.12662
Ilari Kuitunen, Ville T. Ponkilainen, Rasmus Liukkonen, Lauri Nyrhi, Oskari Pakarinen, Matias Vaajala, Mikko M. Uimonen
{"title":"Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments","authors":"Ilari Kuitunen,&nbsp;Ville T. Ponkilainen,&nbsp;Rasmus Liukkonen,&nbsp;Lauri Nyrhi,&nbsp;Oskari Pakarinen,&nbsp;Matias Vaajala,&nbsp;Mikko M. Uimonen","doi":"10.1111/jebm.12662","DOIUrl":null,"url":null,"abstract":"<p>Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [<span>1</span>]. A key part of the evidence synthesis is the critical appraisal of the included studies [<span>2</span>]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [<span>3, 4</span>]. Risk of bias assessments are time-consuming in evidence synthesis projects [<span>5</span>]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [<span>6-8</span>]. The interrater agreement has also shown to be varying between reviewers [<span>9</span>]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.</p><p>The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [<span>10</span>]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [<span>11, 12</span>]. One focused on ROBINS-I tool and found rather low agreement in it [<span>11</span>]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [<span>12</span>]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.</p><p>We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (<i>Lancet</i>, <i>JAMA</i> or <i>BMJ</i>). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.</p><p>Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).</p><p>A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).</p><p>Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [<span>12</span>]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.</p><p>The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [<span>13</span>], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [<span>5</span>]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.</p><p>Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.</p><p>We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":"17 4","pages":"700-702"},"PeriodicalIF":3.6000,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684499/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12662","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [1]. A key part of the evidence synthesis is the critical appraisal of the included studies [2]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [3, 4]. Risk of bias assessments are time-consuming in evidence synthesis projects [5]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [6-8]. The interrater agreement has also shown to be varying between reviewers [9]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.

The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [10]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [11, 12]. One focused on ROBINS-I tool and found rather low agreement in it [11]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [12]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.

We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (Lancet, JAMA or BMJ). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.

Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).

A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).

Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [12]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.

The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [13], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [5]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.

Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.

We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.

The authors declare no conflicts of interest.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估 ChatGPT-4o 在偏差风险评估中的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Evidence‐Based Medicine
Journal of Evidence‐Based Medicine MEDICINE, GENERAL & INTERNAL-
CiteScore
11.20
自引率
1.40%
发文量
42
期刊介绍: The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.
期刊最新文献
Sodium Characteristic Curve Predicts Mortality Risk in Hospitalized Patients: A Retrospective Cohort Study Extraversion and the Brain: A Coordinate-Based Meta-Analysis of Functional Brain Imaging Studies on Positive Affect Characteristics of Quality Improvement Projects in Health Services: A Systematic Scoping Review Clinical Practice Guidelines for Topical NSAIDs in the Treatment of Sports Injuries Issue Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1