ChatGPT Achieves Only Fair Agreement with ACFAS Expert Panelist Clinical Consensus Statements.

Dominick J Casciato, Joshua Calhoun
{"title":"ChatGPT Achieves Only Fair Agreement with ACFAS Expert Panelist Clinical Consensus Statements.","authors":"Dominick J Casciato, Joshua Calhoun","doi":"10.1177/19386400251319567","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly-from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model's alignment with clinical consensus statements in foot and ankle surgery.</p><p><strong>Methods: </strong>Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel's ratings, and intra-rater analysis was performed.</p><p><strong>Results: </strong>The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen's kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen's kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics.</p><p><strong>Conclusion: </strong>ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence.</p><p><strong>Level of evidence: </strong>Level IV: Retrospective cohort study.</p>","PeriodicalId":73046,"journal":{"name":"Foot & ankle specialist","volume":" ","pages":"19386400251319567"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foot & ankle specialist","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/19386400251319567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly-from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model's alignment with clinical consensus statements in foot and ankle surgery.

Methods: Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel's ratings, and intra-rater analysis was performed.

Results: The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen's kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen's kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics.

Conclusion: ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence.

Level of evidence: Level IV: Retrospective cohort study.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Syndesmotic Screw Fixation Versus Suture Button Versus Tibiotalocalcaneal Nail Treatment in Syndesmotic Ankle Fractures: A Meta-Analysis. Asymptomatic Preaxial Polydactyly of Bifid Hallux Without a Supernumerary Digit Presenting With Earlobe Malformations: A Rare Case Report. Correction of Distal Metatarsal Articular Angle in Hallux Valgus Surgery Utilizing a Minimally Invasive Extra-Articular Metaphyseal Distal Transverse Osteotomy. ChatGPT Achieves Only Fair Agreement with ACFAS Expert Panelist Clinical Consensus Statements. Custom 3-Dimensional-Printed Oncologic Endoprosthesis for Reconstruction of the Calcaneus A Case Report.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1