Responses From ChatGPT-4 Show Limited Correlation With Expert Consensus Statement on Anterior Shoulder Instability

Q3 Medicine Arthroscopy Sports Medicine and Rehabilitation Pub Date : 2024-06-01 Epub Date: 2024-03-05 DOI:10.1016/j.asmr.2024.100923

Alexander Artamonov M.D. , Ira Bachar-Avnieli M.D. , Eyal Klang M.D. , Omri Lubovsky M.D. , Ehud Atoun M.D. , Alexander Bermant M.D. , Philip J. Rosinsky M.D.

{"title":"Responses From ChatGPT-4 Show Limited Correlation With Expert Consensus Statement on Anterior Shoulder Instability","authors":"Alexander Artamonov M.D. , Ira Bachar-Avnieli M.D. , Eyal Klang M.D. , Omri Lubovsky M.D. , Ehud Atoun M.D. , Alexander Bermant M.D. , Philip J. Rosinsky M.D.","doi":"10.1016/j.asmr.2024.100923","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>To compare the similarity of answers provided by Generative Pretrained Transformer-4 (GPT-4) with those of a consensus statement on diagnosis, nonoperative management, and Bankart repair in anterior shoulder instability (ASI).</p></div><div><h3>Methods</h3><p>An expert consensus statement on ASI published by Hurley et al. in 2022 was reviewed and questions laid out to the expert panel were extracted. GPT-4, the subscription version of ChatGPT, was queried using the same set of questions. Answers provided by GPT-4 were compared with those of the expert panel and subjectively rated for similarity by 2 experienced shoulder surgeons. GPT-4 was then used to rate the similarity of its own responses to the consensus statement, classifying them as low, medium, or high. Rates of similarity as classified by the shoulder surgeons and GPT-4 were then compared and interobserver reliability calculated using weighted κ scores.</p></div><div><h3>Results</h3><p>The degree of similarity between responses of GPT-4 and the ASI consensus statement, as defined by shoulder surgeons, was high in 25.8%, medium in 45.2%, and low 29% of questions. GPT-4 assessed similarity as high in 48.3%, medium in 41.9%, and low 9.7% of questions. Surgeons and GPT-4 reached consensus on the classification of 18 questions (58.1%) and disagreement on 13 questions (41.9%).</p></div><div><h3>Conclusions</h3><p>The responses generated by artificial intelligence exhibit limited correlation with an expert statement on the diagnosis and treatment of ASI.</p></div><div><h3>Clinical Relevance</h3><p>As the use of artificial intelligence becomes more prevalent, it is important to understand how closely information resembles content produced by human authors.</p></div>","PeriodicalId":34631,"journal":{"name":"Arthroscopy Sports Medicine and Rehabilitation","volume":"6 3","pages":"Article 100923"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666061X24000415/pdfft?md5=64dacbe11c8dcaec53b3b836778ff98c&pid=1-s2.0-S2666061X24000415-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy Sports Medicine and Rehabilitation","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666061X24000415","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/5 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To compare the similarity of answers provided by Generative Pretrained Transformer-4 (GPT-4) with those of a consensus statement on diagnosis, nonoperative management, and Bankart repair in anterior shoulder instability (ASI).

Methods

An expert consensus statement on ASI published by Hurley et al. in 2022 was reviewed and questions laid out to the expert panel were extracted. GPT-4, the subscription version of ChatGPT, was queried using the same set of questions. Answers provided by GPT-4 were compared with those of the expert panel and subjectively rated for similarity by 2 experienced shoulder surgeons. GPT-4 was then used to rate the similarity of its own responses to the consensus statement, classifying them as low, medium, or high. Rates of similarity as classified by the shoulder surgeons and GPT-4 were then compared and interobserver reliability calculated using weighted κ scores.

Results

The degree of similarity between responses of GPT-4 and the ASI consensus statement, as defined by shoulder surgeons, was high in 25.8%, medium in 45.2%, and low 29% of questions. GPT-4 assessed similarity as high in 48.3%, medium in 41.9%, and low 9.7% of questions. Surgeons and GPT-4 reached consensus on the classification of 18 questions (58.1%) and disagreement on 13 questions (41.9%).

Conclusions

The responses generated by artificial intelligence exhibit limited correlation with an expert statement on the diagnosis and treatment of ASI.

Clinical Relevance

As the use of artificial intelligence becomes more prevalent, it is important to understand how closely information resembles content produced by human authors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ChatGPT-4 的回复与肩关节前方不稳专家共识声明的相关性有限

目的比较生成式预训练转换器-4（GPT-4）与关于肩关节前不稳定（ASI）的诊断、非手术治疗和Bankart修复的共识声明所提供答案的相似性。方法回顾了Hurley等人于2022年发表的关于ASI的专家共识声明，并提取了向专家小组提出的问题。使用相同的问题集对 ChatGPT 的订阅版本 GPT-4 进行了查询。GPT-4 提供的答案与专家小组提供的答案进行了比较，并由两名经验丰富的肩部外科医生对相似度进行了主观评分。然后，GPT-4 用于评定自己的回答与共识声明的相似度，将其分为低、中、高三个等级。然后比较肩部外科医生和 GPT-4 的相似度，并使用加权 κ 分数计算观察者之间的可靠性。结果根据肩部外科医生的定义，GPT-4 和 ASI 共识声明之间的相似度在 25.8% 的问题中为高、45.2% 为中、29% 为低。GPT-4 对相似性的评估为高的问题占 48.3%，中等的问题占 41.9%，低的问题占 9.7%。外科医生和 GPT-4 就 18 个问题（占 58.1%）的分类达成了共识，就 13 个问题（占 41.9%）的分类存在分歧。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊