Emplying Large Language Models for Surgical Education: An In-depth Analysis of ChatGPT-4

Journal of Medical Education Pub Date : 2023-10-17 DOI:10.5812/jme-137753

Adrian Hang Yue Siu, Damien Gibson, Xin Mu, Ishith Seth, Alexander Chi Wang Siu, Dilshad Dooreemeah, Angus Lee

{"title":"Emplying Large Language Models for Surgical Education: An In-depth Analysis of ChatGPT-4","authors":"Adrian Hang Yue Siu, Damien Gibson, Xin Mu, Ishith Seth, Alexander Chi Wang Siu, Dilshad Dooreemeah, Angus Lee","doi":"10.5812/jme-137753","DOIUrl":null,"url":null,"abstract":"Background: The growing interest in artificial intelligence (AI) has spurred an increase in the availability of Large Language Models (LLMs) in surgical education. These LLMs hold the potential to augment medical curricula for future healthcare professionals, facilitating engagement in remote learning experiences, and assisting in personalised student feedback. Objectives: To evaluate the ability of LLMs to assist junior doctors in providing advice for common ward-based surgical scenarios with increasing complexity. Methods: Utilising an instrumental case study approach, this study explored the potential of LLMs by comparing the responses of the ChatGPT-4, BingAI and BARD. LLMs were prompted by 3 common ward-based surgical scenarios and tasked with assisting junior doctors in clinical decision-making. The outputs were assessed by a panel of two senior surgeons with extensive experience in AI and education, qualitatively utilising a Likert scale on their accuracy, safety, and effectiveness to determine their viability as a synergistic tool in surgical education. A quantitative assessment of their reliability and readability was conducted using the DISCERN score and a set of reading scores, including the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau index. Results: BARD proved superior in readability, with Flesch Reading Ease Score 50.13 (± 5.00), Flesch-Kincaid Grade Level 9.33 (± 0.76), and Coleman-Liau index 11.67 (± 0.58). ChatGPT-4 outperformed BARD and BingAI, with the highest DISCERN score of 71.7 (± 2.52). Using a Likert scale-based framework, the surgical expert panel further affirmed that the advice provided by the ChatGPT-4 was suitable and safe for first-year interns and residents. A t-test showed statistical significance in reliability among all three AIs (P < 0.05) and readability only between the ChatGPT-4 and BARD. This study underscores the potential for LLM integration in surgical education, particularly ChatGPT, in the provision of reliable and accurate information. Conclusions: This study highlighted the potential of LLM, specifically ChatGPT-4, as a valuable educational resource for junior doctors. The findings are limited by the potential of non-generalizability of the use of junior doctors' simulated scenarios. Future work should aim to optimise learning experiences and better support surgical trainees. Particular attention should be paid to addressing the longitudinal impact of LLMs, refining AI models, validating AI content, and exploring technological amalgamations for improved outcomes.","PeriodicalId":31052,"journal":{"name":"Journal of Medical Education","volume":"04 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5812/jme-137753","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The growing interest in artificial intelligence (AI) has spurred an increase in the availability of Large Language Models (LLMs) in surgical education. These LLMs hold the potential to augment medical curricula for future healthcare professionals, facilitating engagement in remote learning experiences, and assisting in personalised student feedback. Objectives: To evaluate the ability of LLMs to assist junior doctors in providing advice for common ward-based surgical scenarios with increasing complexity. Methods: Utilising an instrumental case study approach, this study explored the potential of LLMs by comparing the responses of the ChatGPT-4, BingAI and BARD. LLMs were prompted by 3 common ward-based surgical scenarios and tasked with assisting junior doctors in clinical decision-making. The outputs were assessed by a panel of two senior surgeons with extensive experience in AI and education, qualitatively utilising a Likert scale on their accuracy, safety, and effectiveness to determine their viability as a synergistic tool in surgical education. A quantitative assessment of their reliability and readability was conducted using the DISCERN score and a set of reading scores, including the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau index. Results: BARD proved superior in readability, with Flesch Reading Ease Score 50.13 (± 5.00), Flesch-Kincaid Grade Level 9.33 (± 0.76), and Coleman-Liau index 11.67 (± 0.58). ChatGPT-4 outperformed BARD and BingAI, with the highest DISCERN score of 71.7 (± 2.52). Using a Likert scale-based framework, the surgical expert panel further affirmed that the advice provided by the ChatGPT-4 was suitable and safe for first-year interns and residents. A t-test showed statistical significance in reliability among all three AIs (P < 0.05) and readability only between the ChatGPT-4 and BARD. This study underscores the potential for LLM integration in surgical education, particularly ChatGPT, in the provision of reliable and accurate information. Conclusions: This study highlighted the potential of LLM, specifically ChatGPT-4, as a valuable educational resource for junior doctors. The findings are limited by the potential of non-generalizability of the use of junior doctors' simulated scenarios. Future work should aim to optimise learning experiences and better support surgical trainees. Particular attention should be paid to addressing the longitudinal impact of LLMs, refining AI models, validating AI content, and exploring technological amalgamations for improved outcomes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

运用大语言模型进行外科教育:对ChatGPT-4的深入分析

背景:对人工智能(AI)日益增长的兴趣刺激了外科教育中大型语言模型(llm)可用性的增加。这些法学硕士有可能为未来的医疗保健专业人员增加医学课程，促进远程学习体验的参与，并协助个性化的学生反馈。目的:评估LLMs协助初级医生为日益复杂的常见病房手术方案提供建议的能力。方法:本研究采用工具性案例研究方法，通过比较ChatGPT-4、BingAI和BARD的反应，探索法学硕士的潜力。llm根据3种常见的基于病房的手术场景进行提示，并负责协助初级医生进行临床决策。结果由两名在人工智能和教育方面经验丰富的资深外科医生组成的小组进行评估，定性地利用李克特量表评估其准确性、安全性和有效性，以确定其作为外科教育协同工具的可行性。使用DISCERN分数和一组阅读分数(包括Flesch reading Ease score、Flesch- kincaid Grade Level和Coleman-Liau指数)对它们的信度和可读性进行定量评估。结果:BARD具有较好的可读性，Flesch Reading Ease评分为50.13(±5.00)，Flesch- kincaid Grade Level为9.33(±0.76)，Coleman-Liau指数为11.67(±0.58)。ChatGPT-4的表现优于BARD和BingAI，最高的分辨力得分为71.7(±2.52)。外科专家小组使用基于李克特量表的框架，进一步肯定了ChatGPT-4提供的建议对第一年的实习生和住院医生是合适和安全的。经t检验，三种ai的信度均有统计学意义(P <0.05)，仅在ChatGPT-4和BARD之间具有可读性。这项研究强调了LLM整合外科教育的潜力，特别是ChatGPT，在提供可靠和准确的信息方面。结论:本研究强调了LLM，特别是ChatGPT-4作为初级医生宝贵的教育资源的潜力。研究结果受限于使用初级医生的模拟情景的潜在非普遍性。未来的工作应旨在优化学习经验，更好地支持外科学员。应特别注意解决法学硕士的纵向影响，完善人工智能模型，验证人工智能内容，并探索技术合并以改善结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Medical Education

自引率

0.00%

发文量

审稿时长

8 weeks