Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.

IF 2.6 2区 医学 Q1 SPORT SCIENCES Sports Health-A Multidisciplinary Approach Pub Date : 2025-11-01 Epub Date: 2025-04-01 DOI:10.1177/19417381251326549
John D Milner, Matthew S Quinn, Phillip Schmitt, Rigel P Hall, Steven Bokshan, Logan Petit, Ryan O'Donnell, Stephen E Marcaccio, Steven F DeFroda, Ramin R Tabaddor, Brett D Owens
{"title":"Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.","authors":"John D Milner, Matthew S Quinn, Phillip Schmitt, Rigel P Hall, Steven Bokshan, Logan Petit, Ryan O'Donnell, Stephen E Marcaccio, Steven F DeFroda, Ramin R Tabaddor, Brett D Owens","doi":"10.1177/19417381251326549","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language model (LLM)-based artificial intelligence (AI) chatbots, such as ChatGPT and Gemini, have become widespread sources of information. Few studies have evaluated LLM responses to questions about orthopaedic conditions, especially osteochondritis dissecans (OCD).</p><p><strong>Hypothesis: </strong>ChatGPT and Gemini will generate accurate responses that align with American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines.</p><p><strong>Study design: </strong>Cohort study.</p><p><strong>Level of evidence: </strong>Level 2.</p><p><strong>Methods: </strong>LLM prompts were created based on AAOS clinical guidelines on OCD diagnosis and treatment, and responses from ChatGPT and Gemini were collected. Seven fellowship-trained orthopaedic surgeons evaluated LLM responses on a 5-point Likert scale, based on 6 categories: relevance, accuracy, clarity, completeness, evidence-based, and consistency.</p><p><strong>Results: </strong>ChatGPT and Gemini exhibited strong performance across all criteria. ChatGPT mean scores were highest for clarity (4.771 ± 0.141 [mean ± SD]). Gemini scored highest for relevance and accuracy (4.286 ± 0.296, 4.286 ± 0.273). For both LLMs, the lowest scores were for evidence-based responses (ChatGPT, 3.857 ± 0.352; Gemini, 3.743 ± 0.353). For all other categories, ChatGPT mean scores were higher than Gemini scores. The consistency of responses between the 2 LLMs was rated at an overall mean of 3.486 ± 0.371. Inter-rater reliability ranged from 0.4 to 0.67 (mean, 0.59) and was highest (0.67) in the accuracy category and lowest (0.4) in the consistency category.</p><p><strong>Conclusion: </strong>LLM performance emphasizes the potential for gathering clinically relevant and accurate answers to questions regarding the diagnosis and treatment of OCD and suggests that ChatGPT may be a better model for this purpose than the Gemini model. Further evaluation of LLM information regarding other orthopaedic procedures and conditions may be necessary before LLMs can be recommended as an accurate source of orthopaedic information.</p><p><strong>Clinical relevance: </strong>Little is known about the ability of AI to provide answers regarding OCD.</p>","PeriodicalId":54276,"journal":{"name":"Sports Health-A Multidisciplinary Approach","volume":" ","pages":"1340-1346"},"PeriodicalIF":2.6000,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11966633/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sports Health-A Multidisciplinary Approach","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/19417381251326549","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/1 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"SPORT SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large language model (LLM)-based artificial intelligence (AI) chatbots, such as ChatGPT and Gemini, have become widespread sources of information. Few studies have evaluated LLM responses to questions about orthopaedic conditions, especially osteochondritis dissecans (OCD).

Hypothesis: ChatGPT and Gemini will generate accurate responses that align with American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines.

Study design: Cohort study.

Level of evidence: Level 2.

Methods: LLM prompts were created based on AAOS clinical guidelines on OCD diagnosis and treatment, and responses from ChatGPT and Gemini were collected. Seven fellowship-trained orthopaedic surgeons evaluated LLM responses on a 5-point Likert scale, based on 6 categories: relevance, accuracy, clarity, completeness, evidence-based, and consistency.

Results: ChatGPT and Gemini exhibited strong performance across all criteria. ChatGPT mean scores were highest for clarity (4.771 ± 0.141 [mean ± SD]). Gemini scored highest for relevance and accuracy (4.286 ± 0.296, 4.286 ± 0.273). For both LLMs, the lowest scores were for evidence-based responses (ChatGPT, 3.857 ± 0.352; Gemini, 3.743 ± 0.353). For all other categories, ChatGPT mean scores were higher than Gemini scores. The consistency of responses between the 2 LLMs was rated at an overall mean of 3.486 ± 0.371. Inter-rater reliability ranged from 0.4 to 0.67 (mean, 0.59) and was highest (0.67) in the accuracy category and lowest (0.4) in the consistency category.

Conclusion: LLM performance emphasizes the potential for gathering clinically relevant and accurate answers to questions regarding the diagnosis and treatment of OCD and suggests that ChatGPT may be a better model for this purpose than the Gemini model. Further evaluation of LLM information regarding other orthopaedic procedures and conditions may be necessary before LLMs can be recommended as an accurate source of orthopaedic information.

Clinical relevance: Little is known about the ability of AI to provide answers regarding OCD.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工智能在解决剥离性骨软骨炎管理问题中的表现。
背景:基于大语言模型(LLM)的人工智能(AI)聊天机器人(如 ChatGPT 和 Gemini)已成为广泛的信息来源。很少有研究评估过 LLM 对骨科疾病,尤其是骨软骨炎(OCD)问题的回答:假设:ChatGPT 和 Gemini 将生成符合美国矫形外科医师学会(AAOS)临床实践指南的准确回复:研究设计:队列研究:证据等级:2 级:根据 AAOS 关于强迫症诊断和治疗的临床指南创建了 LLM 提示,并收集了 ChatGPT 和 Gemini 的回复。七名接受过研究员培训的骨科外科医生根据相关性、准确性、清晰度、完整性、循证性和一致性等 6 个类别,以 5 分制李克特量表对 LLM 回答进行了评估:结果:ChatGPT 和 Gemini 在所有标准上都表现出了很好的性能。ChatGPT 在清晰度方面的平均得分最高(4.771 ± 0.141 [平均值 ± 标准差])。Gemini 在相关性和准确性方面得分最高(4.286 ± 0.296、4.286 ± 0.273)。在两个 LLM 中,基于证据的回答得分最低(ChatGPT,3.857 ± 0.352;Gemini,3.743 ± 0.353)。在所有其他类别中,ChatGPT 的平均得分均高于 Gemini 的得分。两位 LLM 的回答一致性总平均值为 3.486 ± 0.371。评分者之间的可靠性从 0.4 到 0.67 不等(平均值为 0.59),准确性类别的可靠性最高(0.67),一致性类别的可靠性最低(0.4):LLM 的表现强调了收集与临床相关的强迫症诊断和治疗问题的准确答案的潜力,并表明 ChatGPT 可能比 Gemini 模型更适合这一目的。在推荐 LLM 作为骨科信息的准确来源之前,可能有必要进一步评估 LLM 有关其他骨科手术和病症的信息:人们对人工智能提供强迫症答案的能力知之甚少。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Sports Health-A Multidisciplinary Approach
Sports Health-A Multidisciplinary Approach Medicine-Orthopedics and Sports Medicine
CiteScore
6.90
自引率
9.10%
发文量
101
期刊介绍: Sports Health: A Multidisciplinary Approach is an indispensable resource for all medical professionals involved in the training and care of the competitive or recreational athlete, including primary care physicians, orthopaedic surgeons, physical therapists, athletic trainers and other medical and health care professionals. Published bimonthly, Sports Health is a collaborative publication from the American Orthopaedic Society for Sports Medicine (AOSSM), the American Medical Society for Sports Medicine (AMSSM), the National Athletic Trainers’ Association (NATA), and the Sports Physical Therapy Section (SPTS). The journal publishes review articles, original research articles, case studies, images, short updates, legal briefs, editorials, and letters to the editor. Topics include: -Sports Injury and Treatment -Care of the Athlete -Athlete Rehabilitation -Medical Issues in the Athlete -Surgical Techniques in Sports Medicine -Case Studies in Sports Medicine -Images in Sports Medicine -Legal Issues -Pediatric Athletes -General Sports Trauma -Sports Psychology
期刊最新文献
Ultrasonographic Assessment of Posterior Shoulder Capsule Thickness in Baseball Pitchers: A Validation Study. Early Effects of Kinesio Taping on Clinical Outcomes in Patients With Arthroscopic Rotator Cuff Repair: A Double-Blind, Randomized Controlled Trial. Psychological Profile of Trail Runners Associated With Running-Related Injuries: A Prospective Study. Quadriceps Rate of Torque Development Is More Impaired Than Strength 4 to 12 Months Post-ACLR in Collegiate Athletes. Real-Time Concurrent Neurophysiological Responses to Dynamic In-Motion Physical and Cognitive Functional Tasks in Division I Athletes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1