ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.

IF 3 Neonatology Pub Date : 2025-01-01 Epub Date: 2025-02-25 DOI:10.1159/000544857
Ilari Kuitunen, Lauri Nyrhi, Daniele De Luca
{"title":"ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis.","authors":"Ilari Kuitunen, Lauri Nyrhi, Daniele De Luca","doi":"10.1159/000544857","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Only a few studies have addressed the potential of large language models (LLMs) in risk-of-bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk-of-bias assessments of neonatal studies.</p><p><strong>Methods: </strong>We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk-of-bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk-of-bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen's kappa statistics (with 95% confidence intervals) for each risk-of-bias domain and for the overall assessment.</p><p><strong>Results: </strong>From 9 reviews, a total of 61 randomized studies were analyzed. A total of 427 judgments were compared. The overall κ was 0.43 (95% CI: 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95% CI: 0.59-0.70). The Cohen's κ was assessed for each domain and the best agreement was observed in the allocation concealment (κ = 0.73, 95% CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (κ = -0.03, 95% CI: -0.07-0.02).</p><p><strong>Conclusion: </strong>ChatGPT-4o failed to achieve sufficient agreement in the risk-of-bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently, the use of ChatGPT-4o in risk-of-bias assessments should not be promoted.</p>","PeriodicalId":94152,"journal":{"name":"Neonatology","volume":" ","pages":"360-365"},"PeriodicalIF":3.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12129414/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neonatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1159/000544857","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/25 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Only a few studies have addressed the potential of large language models (LLMs) in risk-of-bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk-of-bias assessments of neonatal studies.

Methods: We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk-of-bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk-of-bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen's kappa statistics (with 95% confidence intervals) for each risk-of-bias domain and for the overall assessment.

Results: From 9 reviews, a total of 61 randomized studies were analyzed. A total of 427 judgments were compared. The overall κ was 0.43 (95% CI: 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95% CI: 0.59-0.70). The Cohen's κ was assessed for each domain and the best agreement was observed in the allocation concealment (κ = 0.73, 95% CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (κ = -0.03, 95% CI: -0.07-0.02).

Conclusion: ChatGPT-4o failed to achieve sufficient agreement in the risk-of-bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently, the use of ChatGPT-4o in risk-of-bias assessments should not be promoted.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
新生儿科偏倚风险评估中的 ChatGPT-4o - 有效性分析。
背景:只有少数研究解决了大型语言模型(LLM)在偏见风险评估中的潜力,结果也各不相同。本研究的目的是分析ChatGPT在新生儿研究的偏倚风险评估中的表现。方法:检索2024年发表的所有Cochrane新生儿干预评价,提取所有偏倚风险评价。然后检索完整的报告,并将其与chatgpt - 40的Cochrane原始偏倚风险分析指南一起上传。原始评估与chatgpt - 40提供的评估之间的一致性通过类间相关系数和Cohen's Kappa统计量(每个偏置域风险和总体评估的95%置信区间)进行评估。结果:从9篇综述中,共分析了61项随机研究。总共比较了427个判断。总体kappa为0.43 (95%CI 0.35 ~ 0.51),总体类内相关系数为0.65 (95%CI 0.59 ~ 0.70)。评估了每个领域的Cohen’s kappa,在分配隐藏中观察到最好的一致性(kappa=0.73, 95%CI: 0.55-0.90),而在不完整的结果数据中发现最差的一致性(kappa=-0.03, 95%CI: -0.07-0.02)。结论:chatgpt - 40在偏倚风险评估方面未能达成充分一致。未来的研究应该考察其他LLM的表现是否会更好,或者通过更好的提示是否可以进一步提高chatgpt - 40中的一致性。目前不应推广在偏倚风险评估中使用chatgpt - 40。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Role of heritable and environmental contributions to the development of severe intraventricular hemorrhage in very preterm infants: Results from a multicenter twins cohort study. Three-year Outcomes of Intravitreal Aflibercept versus Laser Therapy for Retinopathy of Prematurity. Methodological Considerations in the Comparison of INSURE and LISA in Very Preterm Infants. Oxygenation Index and Oxygen Saturation Index in Congenital Diaphragmatic Hernia: Do Management Guidelines Make a Difference? Congenital Erythropoietic Porphyria in a Neonate: Utility of Rapid Whole Genome Sequencing - A Case Report.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1