外科医生多久会沦为单纯的技术员?聊天机器人在管理临床场景中的表现。

IF 4.9 1区 医学 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS Journal of Thoracic and Cardiovascular Surgery Pub Date : 2024-11-11 DOI:10.1016/j.jtcvs.2024.11.006
Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson
{"title":"外科医生多久会沦为单纯的技术员?聊天机器人在管理临床场景中的表现。","authors":"Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson","doi":"10.1016/j.jtcvs.2024.11.006","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</p><p><strong>Methods: </strong>Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.</p><p><strong>Results: </strong>Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).</p><p><strong>Conclusions: </strong>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</p>","PeriodicalId":49975,"journal":{"name":"Journal of Thoracic and Cardiovascular Surgery","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How Soon Will Surgeons Become Mere Technicians? Chatbot Performance in Managing Clinical Scenarios.\",\"authors\":\"Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson\",\"doi\":\"10.1016/j.jtcvs.2024.11.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</p><p><strong>Methods: </strong>Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.</p><p><strong>Results: </strong>Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).</p><p><strong>Conclusions: </strong>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</p>\",\"PeriodicalId\":49975,\"journal\":{\"name\":\"Journal of Thoracic and Cardiovascular Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2024-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Thoracic and Cardiovascular Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jtcvs.2024.11.006\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic and Cardiovascular Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jtcvs.2024.11.006","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

目的聊天机器人已在医学和外科领域得到应用,并被建议用于指导临床决策。然而,人工智能(AI)平台所提供信息的准确性一直受到质疑。我们评估了 4 个流行聊天机器人在董事会式考试中的表现,并将结果与一组经董事会认证的胸外科医生进行了比较:方法:根据 ABTS 资格考试的领域开发了临床场景。每个场景包括三个主干,分别用与诊断、评估和治疗相关的关键特征方法编写。ChatGPT-4 、Bard(现为 Gemini)、Perplexity 和 Claude 2 以及随机抽取的 ABTS 认证外科医生对 10 个情景进行了测试。每个场景的最高得分为 3 分。曼-惠特尼 U 检验用于比较外科医生和聊天机器人的得分:21名外科医生完成了考试,其中大部分(14人,66%)在学术或大学环境中执业。聊天机器人每个场景的中位分数为 1.06,而外科医生为 1.88(差异为 0.66,P=0.019)。除两个场景外,外科医生的中位数得分均高于聊天机器人的中位数得分。与外科医生提供的答案相比,聊天机器人的答案更容易被视为关键失败(聊天机器人/情景的中位数为 0.50,外科医生/情景的中位数为 0.19;P=0.016):结论:四种流行的聊天机器人的手术水平明显低于经委员会认证的外科医生。在临床决策中应谨慎使用人工智能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
How Soon Will Surgeons Become Mere Technicians? Chatbot Performance in Managing Clinical Scenarios.

Objective: Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.

Methods: Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.

Results: Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).

Conclusions: Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
11.20
自引率
10.00%
发文量
1079
审稿时长
68 days
期刊介绍: The Journal of Thoracic and Cardiovascular Surgery presents original, peer-reviewed articles on diseases of the heart, great vessels, lungs and thorax with emphasis on surgical interventions. An official publication of The American Association for Thoracic Surgery and The Western Thoracic Surgical Association, the Journal focuses on techniques and developments in acquired cardiac surgery, congenital cardiac repair, thoracic procedures, heart and lung transplantation, mechanical circulatory support and other procedures.
期刊最新文献
Commentary: Two Arteries Walk into a CABG… Is it Better the Second Time Around? Multi-Institutional Model to Predict Intensive Care Unit Length of Stay after Cardiac Surgery. Textbook Outcome after Robotic and Laparoscopic Ivor Lewis Esophagectomy is Associated with Improved Survival - A Propensity Score Matched Analysis. The Importance of Affinity: Organizational Conferences Support the Diversity Needed in Our Specialty. Commentator Discussion: Reverse double switch operation for the borderline left ventricle.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1