Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.

IF 2.6 2区医学 Q1 SPORT SCIENCES Sports Health-A Multidisciplinary Approach Pub Date : 2025-11-01 Epub Date: 2025-04-01 DOI:10.1177/19417381251326549

John D Milner, Matthew S Quinn, Phillip Schmitt, Rigel P Hall, Steven Bokshan, Logan Petit, Ryan O'Donnell, Stephen E Marcaccio, Steven F DeFroda, Ramin R Tabaddor, Brett D Owens

{"title":"Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.","authors":"John D Milner, Matthew S Quinn, Phillip Schmitt, Rigel P Hall, Steven Bokshan, Logan Petit, Ryan O'Donnell, Stephen E Marcaccio, Steven F DeFroda, Ramin R Tabaddor, Brett D Owens","doi":"10.1177/19417381251326549","DOIUrl":null,"url":null,"abstract":"Background: Large language model (LLM)-based artificial intelligence (AI) chatbots, such as ChatGPT and Gemini, have become widespread sources of information. Few studies have evaluated LLM responses to questions about orthopaedic conditions, especially osteochondritis dissecans (OCD).Hypothesis: ChatGPT and Gemini will generate accurate responses that align with American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines.Study design: Cohort study.Level of evidence: Level 2.Methods: LLM prompts were created based on AAOS clinical guidelines on OCD diagnosis and treatment, and responses from ChatGPT and Gemini were collected. Seven fellowship-trained orthopaedic surgeons evaluated LLM responses on a 5-point Likert scale, based on 6 categories: relevance, accuracy, clarity, completeness, evidence-based, and consistency.Results: ChatGPT and Gemini exhibited strong performance across all criteria. ChatGPT mean scores were highest for clarity (4.771 ± 0.141 [mean ± SD]). Gemini scored highest for relevance and accuracy (4.286 ± 0.296, 4.286 ± 0.273). For both LLMs, the lowest scores were for evidence-based responses (ChatGPT, 3.857 ± 0.352; Gemini, 3.743 ± 0.353). For all other categories, ChatGPT mean scores were higher than Gemini scores. The consistency of responses between the 2 LLMs was rated at an overall mean of 3.486 ± 0.371. Inter-rater reliability ranged from 0.4 to 0.67 (mean, 0.59) and was highest (0.67) in the accuracy category and lowest (0.4) in the consistency category.Conclusion: LLM performance emphasizes the potential for gathering clinically relevant and accurate answers to questions regarding the diagnosis and treatment of OCD and suggests that ChatGPT may be a better model for this purpose than the Gemini model. Further evaluation of LLM information regarding other orthopaedic procedures and conditions may be necessary before LLMs can be recommended as an accurate source of orthopaedic information.Clinical relevance: Little is known about the ability of AI to provide answers regarding OCD.","PeriodicalId":54276,"journal":{"name":"Sports Health-A Multidisciplinary Approach","volume":" ","pages":"1340-1346"},"PeriodicalIF":2.6000,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11966633/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sports Health-A Multidisciplinary Approach","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/19417381251326549","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/1 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"SPORT SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language model (LLM)-based artificial intelligence (AI) chatbots, such as ChatGPT and Gemini, have become widespread sources of information. Few studies have evaluated LLM responses to questions about orthopaedic conditions, especially osteochondritis dissecans (OCD).

Hypothesis: ChatGPT and Gemini will generate accurate responses that align with American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines.

Study design: Cohort study.

Level of evidence: Level 2.

Methods: LLM prompts were created based on AAOS clinical guidelines on OCD diagnosis and treatment, and responses from ChatGPT and Gemini were collected. Seven fellowship-trained orthopaedic surgeons evaluated LLM responses on a 5-point Likert scale, based on 6 categories: relevance, accuracy, clarity, completeness, evidence-based, and consistency.

Results: ChatGPT and Gemini exhibited strong performance across all criteria. ChatGPT mean scores were highest for clarity (4.771 ± 0.141 [mean ± SD]). Gemini scored highest for relevance and accuracy (4.286 ± 0.296, 4.286 ± 0.273). For both LLMs, the lowest scores were for evidence-based responses (ChatGPT, 3.857 ± 0.352; Gemini, 3.743 ± 0.353). For all other categories, ChatGPT mean scores were higher than Gemini scores. The consistency of responses between the 2 LLMs was rated at an overall mean of 3.486 ± 0.371. Inter-rater reliability ranged from 0.4 to 0.67 (mean, 0.59) and was highest (0.67) in the accuracy category and lowest (0.4) in the consistency category.

Conclusion: LLM performance emphasizes the potential for gathering clinically relevant and accurate answers to questions regarding the diagnosis and treatment of OCD and suggests that ChatGPT may be a better model for this purpose than the Gemini model. Further evaluation of LLM information regarding other orthopaedic procedures and conditions may be necessary before LLMs can be recommended as an accurate source of orthopaedic information.

Clinical relevance: Little is known about the ability of AI to provide answers regarding OCD.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人工智能在解决剥离性骨软骨炎管理问题中的表现。

背景：基于大语言模型（LLM）的人工智能（AI）聊天机器人（如 ChatGPT 和 Gemini）已成为广泛的信息来源。很少有研究评估过 LLM 对骨科疾病，尤其是骨软骨炎（OCD）问题的回答：假设：ChatGPT 和 Gemini 将生成符合美国矫形外科医师学会（AAOS）临床实践指南的准确回复：研究设计：队列研究：证据等级：2 级：根据 AAOS 关于强迫症诊断和治疗的临床指南创建了 LLM 提示，并收集了 ChatGPT 和 Gemini 的回复。七名接受过研究员培训的骨科外科医生根据相关性、准确性、清晰度、完整性、循证性和一致性等 6 个类别，以 5 分制李克特量表对 LLM 回答进行了评估：结果：ChatGPT 和 Gemini 在所有标准上都表现出了很好的性能。ChatGPT 在清晰度方面的平均得分最高（4.771 ± 0.141 [平均值 ± 标准差]）。Gemini 在相关性和准确性方面得分最高（4.286 ± 0.296、4.286 ± 0.273）。在两个 LLM 中，基于证据的回答得分最低（ChatGPT，3.857 ± 0.352；Gemini，3.743 ± 0.353）。在所有其他类别中，ChatGPT 的平均得分均高于 Gemini 的得分。两位 LLM 的回答一致性总平均值为 3.486 ± 0.371。评分者之间的可靠性从 0.4 到 0.67 不等（平均值为 0.59），准确性类别的可靠性最高（0.67），一致性类别的可靠性最低（0.4）：LLM 的表现强调了收集与临床相关的强迫症诊断和治疗问题的准确答案的潜力，并表明 ChatGPT 可能比 Gemini 模型更适合这一目的。在推荐 LLM 作为骨科信息的准确来源之前，可能有必要进一步评估 LLM 有关其他骨科手术和病症的信息：人们对人工智能提供强迫症答案的能力知之甚少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Sports Health-A Multidisciplinary Approach Medicine-Orthopedics and Sports Medicine

CiteScore

6.90

自引率

9.10%

发文量

101

期刊介绍： Sports Health: A Multidisciplinary Approach is an indispensable resource for all medical professionals involved in the training and care of the competitive or recreational athlete, including primary care physicians, orthopaedic surgeons, physical therapists, athletic trainers and other medical and health care professionals. Published bimonthly, Sports Health is a collaborative publication from the American Orthopaedic Society for Sports Medicine (AOSSM), the American Medical Society for Sports Medicine (AMSSM), the National Athletic Trainers’ Association (NATA), and the Sports Physical Therapy Section (SPTS). The journal publishes review articles, original research articles, case studies, images, short updates, legal briefs, editorials, and letters to the editor. Topics include: -Sports Injury and Treatment -Care of the Athlete -Athlete Rehabilitation -Medical Issues in the Athlete -Surgical Techniques in Sports Medicine -Case Studies in Sports Medicine -Images in Sports Medicine -Legal Issues -Pediatric Athletes -General Sports Trauma -Sports Psychology