David Shin , Hyunah Park , Isabel Shaffrey , Vahe Yacoubian , Taha M. Taka , Justin Dye , Olumide Danisa
{"title":"Artificial intelligence versus clinical judgement: how accurately do generative models reflect CNS guidelines for chiari malformation?","authors":"David Shin , Hyunah Park , Isabel Shaffrey , Vahe Yacoubian , Taha M. Taka , Justin Dye , Olumide Danisa","doi":"10.1016/j.clineuro.2024.108662","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>This study investigated the response and readability of generative artificial intelligence (AI) models to questions and recommendations proposed by the 2023 Congress of Neurological Surgeons (CNS) guidelines for Chiari 1 malformation.</div></div><div><h3>Methods</h3><div>Thirteen questions were generated from CNS guidelines and asked to Perplexity, ChatGPT 4o, Microsoft Copilot, and Google Gemini. AI answers were divided into two categories, \"concordant\" and \"non-concordant,\" according to their alignment with current CNS guidelines. Non-concordant answers were sub-categorized as “insufficient” or “over-conclusive.” Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG (Simple Measure of Gobbledygook) Index, and Flesch Reading Ease test.</div></div><div><h3>Results</h3><div>Perplexity displayed the highest concordance rate of 69.2 %, with non-concordant responses classified as 0 % insufficient and 30.8 % over-conclusive. ChatGPT 4o had the lowest concordance rate at 23.1 %, with 0 % insufficient and 76.9 % over-conclusive classifications. Copilot showed a 61.5 % concordance rate, with 7.7 % insufficient and 30.8 % over-conclusive. Gemini demonstrated a 30.8 % concordance rate, with 7.7 % insufficient and 61.5 % as over-conclusive. Flesch-Kincaid Grade Level scores ranged from 14.48 (Gemini) to 16.48 (Copilot), Gunning Fog Index scores varied between 16.18 (Gemini) and 18.8 (Copilot), SMOG Index scores ranged from 16 (Gemini) to 17.54 (Copilot), and Flesch Reading Ease scores were low across all models, with Gemini showing the highest mean score of 21.3.</div></div><div><h3>Conclusion</h3><div>Perplexity and Copilot emerged as the best-performing for concordance, while ChatGPT and Gemini displayed the highest over-conclusive rates. All responses showcased high complexity and difficult readability. While AI can be valuable in certain aspects of clinical practice, the low concordance rates show that AI should not replace clinician judgement.</div></div>","PeriodicalId":10385,"journal":{"name":"Clinical Neurology and Neurosurgery","volume":"248 ","pages":"Article 108662"},"PeriodicalIF":1.8000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neurology and Neurosurgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0303846724005493","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
This study investigated the response and readability of generative artificial intelligence (AI) models to questions and recommendations proposed by the 2023 Congress of Neurological Surgeons (CNS) guidelines for Chiari 1 malformation.
Methods
Thirteen questions were generated from CNS guidelines and asked to Perplexity, ChatGPT 4o, Microsoft Copilot, and Google Gemini. AI answers were divided into two categories, "concordant" and "non-concordant," according to their alignment with current CNS guidelines. Non-concordant answers were sub-categorized as “insufficient” or “over-conclusive.” Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG (Simple Measure of Gobbledygook) Index, and Flesch Reading Ease test.
Results
Perplexity displayed the highest concordance rate of 69.2 %, with non-concordant responses classified as 0 % insufficient and 30.8 % over-conclusive. ChatGPT 4o had the lowest concordance rate at 23.1 %, with 0 % insufficient and 76.9 % over-conclusive classifications. Copilot showed a 61.5 % concordance rate, with 7.7 % insufficient and 30.8 % over-conclusive. Gemini demonstrated a 30.8 % concordance rate, with 7.7 % insufficient and 61.5 % as over-conclusive. Flesch-Kincaid Grade Level scores ranged from 14.48 (Gemini) to 16.48 (Copilot), Gunning Fog Index scores varied between 16.18 (Gemini) and 18.8 (Copilot), SMOG Index scores ranged from 16 (Gemini) to 17.54 (Copilot), and Flesch Reading Ease scores were low across all models, with Gemini showing the highest mean score of 21.3.
Conclusion
Perplexity and Copilot emerged as the best-performing for concordance, while ChatGPT and Gemini displayed the highest over-conclusive rates. All responses showcased high complexity and difficult readability. While AI can be valuable in certain aspects of clinical practice, the low concordance rates show that AI should not replace clinician judgement.
期刊介绍:
Clinical Neurology and Neurosurgery is devoted to publishing papers and reports on the clinical aspects of neurology and neurosurgery. It is an international forum for papers of high scientific standard that are of interest to Neurologists and Neurosurgeons world-wide.