{"title":"ChatGPT Achieves Only Fair Agreement with ACFAS Expert Panelist Clinical Consensus Statements.","authors":"Dominick J Casciato, Joshua Calhoun","doi":"10.1177/19386400251319567","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly-from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model's alignment with clinical consensus statements in foot and ankle surgery.</p><p><strong>Methods: </strong>Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel's ratings, and intra-rater analysis was performed.</p><p><strong>Results: </strong>The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen's kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen's kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics.</p><p><strong>Conclusion: </strong>ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence.</p><p><strong>Level of evidence: </strong>Level IV: Retrospective cohort study.</p>","PeriodicalId":73046,"journal":{"name":"Foot & ankle specialist","volume":" ","pages":"19386400251319567"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foot & ankle specialist","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/19386400251319567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly-from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model's alignment with clinical consensus statements in foot and ankle surgery.
Methods: Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel's ratings, and intra-rater analysis was performed.
Results: The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen's kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen's kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics.
Conclusion: ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence.
Level of evidence: Level IV: Retrospective cohort study.