Artificial intelligence versus clinical judgement: how accurately do generative models reflect CNS guidelines for chiari malformation?

IF 1.8 4区医学 Q3 CLINICAL NEUROLOGY Clinical Neurology and Neurosurgery Pub Date : 2024-11-26 DOI:10.1016/j.clineuro.2024.108662

David Shin , Hyunah Park , Isabel Shaffrey , Vahe Yacoubian , Taha M. Taka , Justin Dye , Olumide Danisa

{"title":"Artificial intelligence versus clinical judgement: how accurately do generative models reflect CNS guidelines for chiari malformation?","authors":"David Shin , Hyunah Park , Isabel Shaffrey , Vahe Yacoubian , Taha M. Taka , Justin Dye , Olumide Danisa","doi":"10.1016/j.clineuro.2024.108662","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>This study investigated the response and readability of generative artificial intelligence (AI) models to questions and recommendations proposed by the 2023 Congress of Neurological Surgeons (CNS) guidelines for Chiari 1 malformation.</div></div><div><h3>Methods</h3><div>Thirteen questions were generated from CNS guidelines and asked to Perplexity, ChatGPT 4o, Microsoft Copilot, and Google Gemini. AI answers were divided into two categories, \"concordant\" and \"non-concordant,\" according to their alignment with current CNS guidelines. Non-concordant answers were sub-categorized as “insufficient” or “over-conclusive.” Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG (Simple Measure of Gobbledygook) Index, and Flesch Reading Ease test.</div></div><div><h3>Results</h3><div>Perplexity displayed the highest concordance rate of 69.2 %, with non-concordant responses classified as 0 % insufficient and 30.8 % over-conclusive. ChatGPT 4o had the lowest concordance rate at 23.1 %, with 0 % insufficient and 76.9 % over-conclusive classifications. Copilot showed a 61.5 % concordance rate, with 7.7 % insufficient and 30.8 % over-conclusive. Gemini demonstrated a 30.8 % concordance rate, with 7.7 % insufficient and 61.5 % as over-conclusive. Flesch-Kincaid Grade Level scores ranged from 14.48 (Gemini) to 16.48 (Copilot), Gunning Fog Index scores varied between 16.18 (Gemini) and 18.8 (Copilot), SMOG Index scores ranged from 16 (Gemini) to 17.54 (Copilot), and Flesch Reading Ease scores were low across all models, with Gemini showing the highest mean score of 21.3.</div></div><div><h3>Conclusion</h3><div>Perplexity and Copilot emerged as the best-performing for concordance, while ChatGPT and Gemini displayed the highest over-conclusive rates. All responses showcased high complexity and difficult readability. While AI can be valuable in certain aspects of clinical practice, the low concordance rates show that AI should not replace clinician judgement.</div></div>","PeriodicalId":10385,"journal":{"name":"Clinical Neurology and Neurosurgery","volume":"248 ","pages":"Article 108662"},"PeriodicalIF":1.8000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neurology and Neurosurgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0303846724005493","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

This study investigated the response and readability of generative artificial intelligence (AI) models to questions and recommendations proposed by the 2023 Congress of Neurological Surgeons (CNS) guidelines for Chiari 1 malformation.

Methods

Thirteen questions were generated from CNS guidelines and asked to Perplexity, ChatGPT 4o, Microsoft Copilot, and Google Gemini. AI answers were divided into two categories, "concordant" and "non-concordant," according to their alignment with current CNS guidelines. Non-concordant answers were sub-categorized as “insufficient” or “over-conclusive.” Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG (Simple Measure of Gobbledygook) Index, and Flesch Reading Ease test.

Results

Perplexity displayed the highest concordance rate of 69.2 %, with non-concordant responses classified as 0 % insufficient and 30.8 % over-conclusive. ChatGPT 4o had the lowest concordance rate at 23.1 %, with 0 % insufficient and 76.9 % over-conclusive classifications. Copilot showed a 61.5 % concordance rate, with 7.7 % insufficient and 30.8 % over-conclusive. Gemini demonstrated a 30.8 % concordance rate, with 7.7 % insufficient and 61.5 % as over-conclusive. Flesch-Kincaid Grade Level scores ranged from 14.48 (Gemini) to 16.48 (Copilot), Gunning Fog Index scores varied between 16.18 (Gemini) and 18.8 (Copilot), SMOG Index scores ranged from 16 (Gemini) to 17.54 (Copilot), and Flesch Reading Ease scores were low across all models, with Gemini showing the highest mean score of 21.3.

Conclusion

Perplexity and Copilot emerged as the best-performing for concordance, while ChatGPT and Gemini displayed the highest over-conclusive rates. All responses showcased high complexity and difficult readability. While AI can be valuable in certain aspects of clinical practice, the low concordance rates show that AI should not replace clinician judgement.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Clinical Neurology and Neurosurgery 医学-临床神经学

CiteScore

3.70

自引率

5.30%

发文量

358

审稿时长

46 days

期刊介绍： Clinical Neurology and Neurosurgery is devoted to publishing papers and reports on the clinical aspects of neurology and neurosurgery. It is an international forum for papers of high scientific standard that are of interest to Neurologists and Neurosurgeons world-wide.