ChatGPT与神经外科住院医师在神经外科委员会考试类问题中的表现：系统回顾和荟萃分析。

IF 2.8 3区医学 Q2 CLINICAL NEUROLOGY Neurosurgical Review Pub Date : 2024-12-07 DOI:10.1007/s10143-024-03144-y

Edgar Dominic A Bongco, Sean Kendrich N Cua, Mary Angeline Luz U Hernandez, Juan Silvestre G Pascual, Kathleen Joy O Khu

{"title":"ChatGPT与神经外科住院医师在神经外科委员会考试类问题中的表现：系统回顾和荟萃分析。","authors":"Edgar Dominic A Bongco, Sean Kendrich N Cua, Mary Angeline Luz U Hernandez, Juan Silvestre G Pascual, Kathleen Joy O Khu","doi":"10.1007/s10143-024-03144-y","DOIUrl":null,"url":null,"abstract":"Objective: Large language models and ChatGPT have been used in different fields of medical education. This study aimed to review the literature on the performance of ChatGPT in neurosurgery board examination-like questions compared to neurosurgery residents.Methods: A literature search was performed following PRISMA guidelines, covering the time period of ChatGPT's inception (November 2022) until October 25, 2024. Two reviewers screened for eligible studies, selecting those that used ChatGPT to answer neurosurgery board examination-like questions and compared the results with neurosurgery residents' scores. Risk of bias was assessed using JBI critical appraisal tool. Overall effect sizes and 95% confidence intervals were determined using a fixed-effects model with alpha at 0.05.Results: After screening, six studies were selected for qualitative and quantitative analysis. Accuracy of ChatGPT ranged from 50.4 to 78.8%, compared to residents' accuracy of 58.3 to 73.7%. Risk of bias was low in 4 out of 6 studies reviewed; the rest had moderate risk. There was an overall trend favoring neurosurgery residents versus ChatGPT (p < 0.00001), with high heterogeneity (I2 = 96). These findings were similar on sub-group analysis of studies that used the Self-assessment in Neurosurgery (SANS) examination questions. However, on sensitivity analysis, removal of the highest weighted study skewed the results toward better performance of ChatGPT.Conclusion: Our meta-analysis showed that neurosurgery residents performed better than ChatGPT in answering neurosurgery board examination-like questions, although reviewed studies had high heterogeneity. Further improvement is necessary before it can become a useful and reliable supplementary tool in the delivery of neurosurgical education.","PeriodicalId":19184,"journal":{"name":"Neurosurgical Review","volume":"47 1","pages":"892"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis.\",\"authors\":\"Edgar Dominic A Bongco, Sean Kendrich N Cua, Mary Angeline Luz U Hernandez, Juan Silvestre G Pascual, Kathleen Joy O Khu\",\"doi\":\"10.1007/s10143-024-03144-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: Large language models and ChatGPT have been used in different fields of medical education. This study aimed to review the literature on the performance of ChatGPT in neurosurgery board examination-like questions compared to neurosurgery residents.Methods: A literature search was performed following PRISMA guidelines, covering the time period of ChatGPT's inception (November 2022) until October 25, 2024. Two reviewers screened for eligible studies, selecting those that used ChatGPT to answer neurosurgery board examination-like questions and compared the results with neurosurgery residents' scores. Risk of bias was assessed using JBI critical appraisal tool. Overall effect sizes and 95% confidence intervals were determined using a fixed-effects model with alpha at 0.05.Results: After screening, six studies were selected for qualitative and quantitative analysis. Accuracy of ChatGPT ranged from 50.4 to 78.8%, compared to residents' accuracy of 58.3 to 73.7%. Risk of bias was low in 4 out of 6 studies reviewed; the rest had moderate risk. There was an overall trend favoring neurosurgery residents versus ChatGPT (p < 0.00001), with high heterogeneity (I2 = 96). These findings were similar on sub-group analysis of studies that used the Self-assessment in Neurosurgery (SANS) examination questions. However, on sensitivity analysis, removal of the highest weighted study skewed the results toward better performance of ChatGPT.Conclusion: Our meta-analysis showed that neurosurgery residents performed better than ChatGPT in answering neurosurgery board examination-like questions, although reviewed studies had high heterogeneity. Further improvement is necessary before it can become a useful and reliable supplementary tool in the delivery of neurosurgical education.\",\"PeriodicalId\":19184,\"journal\":{\"name\":\"Neurosurgical Review\",\"volume\":\"47 1\",\"pages\":\"892\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurosurgical Review\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s10143-024-03144-y\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurosurgical Review","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10143-024-03144-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：大型语言模型和ChatGPT在医学教育的不同领域得到应用。本研究旨在回顾与神经外科住院医师相比，ChatGPT在神经外科委员会考试类问题中的表现的文献。方法：根据PRISMA指南进行文献检索，涵盖ChatGPT成立（2022年11月）至2024年10月25日这段时间。两名审稿人筛选符合条件的研究，选择那些使用ChatGPT回答神经外科委员会考试样问题的研究，并将结果与神经外科住院医师的分数进行比较。使用JBI关键评估工具评估偏倚风险。总体效应大小和95%置信区间采用固定效应模型，alpha值为0.05。结果：经筛选，选择6项研究进行定性和定量分析。ChatGPT的准确率为50.4% ~ 78.8%，而居民的准确率为58.3% ~ 73.7%。6项研究中有4项偏倚风险较低；其余的人有中等风险。总体趋势倾向于神经外科住院医师与ChatGPT （p 2 = 96）。这些发现与使用神经外科自我评估（SANS）试题的研究的亚组分析相似。然而，在敏感性分析中，去除权重最高的研究使结果偏向于ChatGPT的更好性能。结论：我们的荟萃分析显示，神经外科住院医师在回答神经外科委员会考试类问题方面的表现优于ChatGPT，尽管所回顾的研究具有很高的异质性。进一步的改进是必要的，才能成为一个有用的和可靠的辅助工具，在神经外科教育的交付。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis.

Objective: Large language models and ChatGPT have been used in different fields of medical education. This study aimed to review the literature on the performance of ChatGPT in neurosurgery board examination-like questions compared to neurosurgery residents.

Methods: A literature search was performed following PRISMA guidelines, covering the time period of ChatGPT's inception (November 2022) until October 25, 2024. Two reviewers screened for eligible studies, selecting those that used ChatGPT to answer neurosurgery board examination-like questions and compared the results with neurosurgery residents' scores. Risk of bias was assessed using JBI critical appraisal tool. Overall effect sizes and 95% confidence intervals were determined using a fixed-effects model with alpha at 0.05.

Results: After screening, six studies were selected for qualitative and quantitative analysis. Accuracy of ChatGPT ranged from 50.4 to 78.8%, compared to residents' accuracy of 58.3 to 73.7%. Risk of bias was low in 4 out of 6 studies reviewed; the rest had moderate risk. There was an overall trend favoring neurosurgery residents versus ChatGPT (p < 0.00001), with high heterogeneity (I² = 96). These findings were similar on sub-group analysis of studies that used the Self-assessment in Neurosurgery (SANS) examination questions. However, on sensitivity analysis, removal of the highest weighted study skewed the results toward better performance of ChatGPT.

Conclusion: Our meta-analysis showed that neurosurgery residents performed better than ChatGPT in answering neurosurgery board examination-like questions, although reviewed studies had high heterogeneity. Further improvement is necessary before it can become a useful and reliable supplementary tool in the delivery of neurosurgical education.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurosurgical Review 医学-临床神经学

CiteScore

5.60

自引率

7.10%

发文量

191

审稿时长

6-12 weeks

期刊介绍： The goal of Neurosurgical Review is to provide a forum for comprehensive reviews on current issues in neurosurgery. Each issue contains up to three reviews, reflecting all important aspects of one topic (a disease or a surgical approach). Comments by a panel of experts within the same issue complete the topic. By providing comprehensive coverage of one topic per issue, Neurosurgical Review combines the topicality of professional journals with the indepth treatment of a monograph. Original papers of high quality are also welcome.