Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis

The BMJ Pub Date : 2024-12-20 DOI:10.1136/bmj-2024-081948
Roy Dayan, Benjamin Uliel, Gal Koplewitz
{"title":"Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis","authors":"Roy Dayan, Benjamin Uliel, Gal Koplewitz","doi":"10.1136/bmj-2024-081948","DOIUrl":null,"url":null,"abstract":"Objective To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests. Design Cross sectional analysis. Setting Online interaction with large language models via text based prompts. Participants Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). Assessments The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test. Main outcome measures MoCA scores, performance in visuospatial/executive tasks, and Stroop test results. Results ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. Conclusions With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence. No additional data available.","PeriodicalId":22388,"journal":{"name":"The BMJ","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The BMJ","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmj-2024-081948","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests. Design Cross sectional analysis. Setting Online interaction with large language models via text based prompts. Participants Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). Assessments The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test. Main outcome measures MoCA scores, performance in visuospatial/executive tasks, and Stroop test results. Results ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. Conclusions With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence. No additional data available.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
目的 通过蒙特利尔认知评估(MoCA)和其他测试,评估主要大型语言模型的认知能力,并确定其认知障碍的易感性。设计 横断面分析。设置 通过文本提示与大型语言模型进行在线互动。参与者 公开的大型语言模型或 "聊天机器人":ChatGPT第4版和第4o版(由OpenAI开发)、Claude 3.5 "Sonnet"(由Anthropic开发)以及Gemini第1版和第1.5版(由Alphabet开发)。评估 对主要的大型语言模型进行 MoCA 测试(8.1 版),测试说明与人类患者完全相同。评分遵循官方指南,并由执业神经科医生进行评估。额外的评估包括纳文图、饼干盗窃图片、波佩勒特图和 Stroop 测试。主要结果指标 MoCA 评分、视觉空间/执行任务表现和 Stroop 测试结果。结果 ChatGPT 4o 在 MoCA 测试中得分最高(26/30),其次是 ChatGPT 4 和 Claude(25/30),Gemini 1.0 得分最低(16/30)。所有大型语言模型在视觉空间/执行任务方面都表现不佳。双子座模型在延迟回忆任务中失败。只有 ChatGPT 4o 在 Stroop 测试的不一致阶段取得了成功。结论 除了 ChatGPT 4o 之外,几乎所有接受 MoCA 测试的大型语言模型都显示出轻度认知障碍的迹象。此外,与人类一样,年龄也是决定认知能力衰退的关键因素:"年长的 "聊天机器人和年长的病人一样,在 MoCA 测试中的表现往往较差。这些发现对人工智能将很快取代人类医生的假设提出了质疑,因为主要聊天机器人明显的认知障碍可能会影响它们在医疗诊断中的可靠性,并削弱患者的信心。暂无更多数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The BMJ Appeal 2024-25: Starvation and malnutrition escalating the threat of Sudan’s civil war When I use a word . . . Academic integrity—principles and definitions Governing global health with a planetary mindset Death doulas could lead end-of-life care Prioritising patients back to work: a small step for NICE, a giant leap for the NHS
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1