目前可用的大语言模型无法提供与循证临床实践指南一致的肌肉骨骼治疗建议。

Benedict U Nwachukwu, Nathan H Varady, Answorth A Allen, Joshua S Dines, David W Altchek, Riley J Williams, Kyle N Kunze
{"title":"目前可用的大语言模型无法提供与循证临床实践指南一致的肌肉骨骼治疗建议。","authors":"Benedict U Nwachukwu, Nathan H Varady, Answorth A Allen, Joshua S Dines, David W Altchek, Riley J Williams, Kyle N Kunze","doi":"10.1016/j.arthro.2024.07.040","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).</p><p><strong>Methods: </strong>All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.</p><p><strong>Results: </strong>Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.</p><p><strong>Conclusions: </strong>Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.</p><p><strong>Clinical relevance: </strong>Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.</p>","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.\",\"authors\":\"Benedict U Nwachukwu, Nathan H Varady, Answorth A Allen, Joshua S Dines, David W Altchek, Riley J Williams, Kyle N Kunze\",\"doi\":\"10.1016/j.arthro.2024.07.040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).</p><p><strong>Methods: </strong>All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.</p><p><strong>Results: </strong>Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.</p><p><strong>Conclusions: </strong>Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.</p><p><strong>Clinical relevance: </strong>Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.</p>\",\"PeriodicalId\":55459,\"journal\":{\"name\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.arthro.2024.07.040\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arthro.2024.07.040","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

摘要

目的:确定市场上销售的主要 LLM 是否提供与美国骨科外科医生学会(AAOS)制定的循证临床实践指南(CPG)一致的治疗建议:从 AAOS 中提取了所有关于肩袖撕裂(33 条)和前交叉韧带损伤(15 条)治疗的 CPG。聊天生成预训练转换器第四版[ChatGPT-4;OpenAI]、Gemini(谷歌)、Mistral-7B(Mistral AI)和Claude-3(Anthropic)的治疗建议由两名盲人医生根据AAOS CPGs分为 "一致"、"不一致 "或 "不确定"(即没有明确建议的中性反应)。我们对 LLM 与 AAOS 建议之间的总体一致性进行了量化,并通过费舍尔精确检验对四个 LLM 之间建议的总体一致性进行了比较评估:结果:共有 135 份(70.3%)答复一致,43 份(22.4%)答复不一致,14 份(7.3%)答复不一致。评分者之间的一致性分类可靠性极佳(Kappa=0.92)。与 AAOS CPGs 一致最多的是 ChatGPT-4(38 人,79.2%),最少的是 Mistral-7B(28 人,58.3%)。Mistral-7B最常出现不确定的建议(n=17,35.4%),Claude-3最少(n=8,6.7%)。Gemini最常出现不一致的建议(n=6,12.5%),ChatGPT-4最少(n=1,2.1%)。总体而言,不同 LLM 的建议不一致情况无统计学差异(P=0.12)。在所有建议中,只有 20 项(10.4%)是透明的,并提供了完整的参考文献或链接到具体的同行评审内容以支持建议:结论:在主要的市售 LLM 中,超过四分之一的有关肩袖和前交叉韧带损伤评估和管理的建议没有反映当前基于证据的 CPG。虽然 ChatGPT-4 的性能最高,但临床上仍观察到了大量无一致性或无支持证据的建议。只有 10% 的 LLMs 答复是透明的,这使得用户无法全面解释提供建议的来源:临床相关性:虽然领先的 LLMs 一般都会提供与 CPGs 一致的建议,但也存在很大的错误率,而且与这些 CPGs 不一致的建议比例表明,LLMs 目前还不是值得信赖的临床支持工具。每种现成的封闭源 LLM 都有优点和缺点。未来的研究应该对多种 LLM 进行评估和比较,以避免出现目前文献中观察到的只对少数模型进行狭隘评估的偏差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.

Purpose: To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).

Methods: All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.

Results: Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.

Conclusions: Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.

Clinical relevance: Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
9.30
自引率
17.00%
发文量
555
审稿时长
58 days
期刊介绍: Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.
期刊最新文献
Author Reply to Editorial Comment "Autologous Minced Repair of Knee Cartilage Is Safely and Effectively Performed Using Arthroscopic Techniques". Culture Expansion Alters Human Bone Marrow Derived Mesenchymal Stem Cell Production of Osteoarthritis-relevant Cytokines and Growth Factors. Steeper Slope of the Medial Tibial Plateau, Greater Varus Alignment, and Narrower Intercondylar Distance and Notch Width Increase Risk for Medial Meniscus Posterior Root Tears: A Systematic Review. Synthetic Medial Meniscus Implant Demonstrates High Reoperation Rates: Patients Who Retain Implant or Require Implant Exchange SHow Improvement For Post Meniscectomy Knee Pain Is Associated With Clinical Improvement But High Reoperation Rates At 2-Years Post-Operatively. The Knee Anterolateral Ligament is Present in 82% of North American and 65% of European But Only in 46% of Asian Studies: A Systematic Review of Frequency and Anatomy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1