Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.

IF 1.4 3区 医学 Q2 PEDIATRICS European Journal of Pediatric Surgery Pub Date : 2025-10-01 Epub Date: 2025-03-05 DOI:10.1055/a-2551-2131
Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert
{"title":"Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance.","authors":"Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert","doi":"10.1055/a-2551-2131","DOIUrl":null,"url":null,"abstract":"<p><p>The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (<i>p</i> < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (<i>p</i> < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.</p>","PeriodicalId":56316,"journal":{"name":"European Journal of Pediatric Surgery","volume":" ","pages":"382-389"},"PeriodicalIF":1.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Pediatric Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2551-2131","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/5 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0

Abstract

The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
解决复杂的儿科手术案例研究:副驾驶、ChatGPT-4和经验丰富的儿科医生的比较分析。
大型语言模型(llm)的出现导致了包括医学在内的多个领域的显著进步。然而,它们在儿科手术中的作用在很大程度上仍未被探索。本研究旨在评估人工智能模型ChatGPT-4和Microsoft Copilot提出诊断程序、初步诊断和鉴别诊断的能力,并利用复杂的儿科外科经典疾病临床病例短片回答临床问题。方法:研究于2024年4月进行。我们使用13个儿科外科疾病的复杂临床病例来评估llm的表现,并比较了一组经验丰富的儿科外科医生的反应。此外,儿科外科医生对LLMs的诊断建议的完整性和准确性进行了评价。为了确定性能上的差异,我们进行了统计分析。结果:ChatGPT-4的测试得分(52.1%)高于Copilot(47.9%),但低于儿科外科医生(68.8%)。ChatGPT-4、Copilot和儿科外科医生之间的总体表现差异具有统计学意义(p)。结论:本研究揭示了人工智能模型在儿科外科中的表现存在显著局限性。尽管法学硕士在各个领域都表现出潜力,但他们在处理临床决策任务方面的可靠性和准确性是有限的。需要进一步的研究来提高人工智能的能力,并确定其在临床环境中的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
3.90
自引率
5.60%
发文量
66
审稿时长
6-12 weeks
期刊介绍: This broad-based international journal updates you on vital developments in pediatric surgery through original articles, abstracts of the literature, and meeting announcements. You will find state-of-the-art information on: abdominal and thoracic surgery neurosurgery urology gynecology oncology orthopaedics traumatology anesthesiology child pathology embryology morphology Written by surgeons, physicians, anesthesiologists, radiologists, and others involved in the surgical care of neonates, infants, and children, the EJPS is an indispensable resource for all specialists.
期刊最新文献
Anorectal Malformation with Rectoperineal Fistula in Females Treated with a Posterior Rectal Advancement Anoplasty: Report of Early Outcomes. Acquired Diaphragmatic Hernia Following Pediatric Liver Transplantation: Incidence, Risk Factors, and Surgical Outcomes. Learning Curve and Early Outcomes of Thoracoscopic Anatomical Lesion Resection for Congenital Pulmonary Airway Malformation in Children: A Single-surgeon Experience. In-Office Pit Excision for Pilonidal Disease Using Needle-Free Local Anesthesia: A Minimally Invasive, Non-Operative Treatment Approach. Pediatric Empyema in the Post-Pandemic Period: Evaluating Changing Trends in Microbiology, Investigations, Fibrinolysis, and Surgical Outcomes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1