Evaluating cognitive performance: Traditional methods vs. ChatGPT.

IF 2.9 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES DIGITAL HEALTH Pub Date : 2024-08-16 eCollection Date: 2024-01-01 DOI:10.1177/20552076241264639

Xiao Fei, Ying Tang, Jianan Zhang, Zhongkai Zhou, Ikuo Yamamoto, Yi Zhang

{"title":"Evaluating cognitive performance: Traditional methods vs. ChatGPT.","authors":"Xiao Fei, Ying Tang, Jianan Zhang, Zhongkai Zhou, Ikuo Yamamoto, Yi Zhang","doi":"10.1177/20552076241264639","DOIUrl":null,"url":null,"abstract":"Background: NLP models like ChatGPT promise to revolutionize text-based content delivery, particularly in medicine. Yet, doubts remain about ChatGPT's ability to reliably support evaluations of cognitive performance, warranting further investigation into its accuracy and comprehensiveness in this area.Method: A cohort of 60 cognitively normal individuals and 30 stroke survivors underwent a comprehensive evaluation, covering memory, numerical processing, verbal fluency, and abstract thinking. Healthcare professionals and NLP models GPT-3.5 and GPT-4 conducted evaluations following established standards. Scores were compared, and efforts were made to refine scoring protocols and interaction methods to enhance ChatGPT's potential in these evaluations.Result: Within the cohort of healthy participants, the utilization of GPT-3.5 revealed significant disparities in memory evaluation compared to both physician-led assessments and those conducted utilizing GPT-4 (P < 0.001). Furthermore, within the domain of memory evaluation, GPT-3.5 exhibited discrepancies in 8 out of 21 specific measures when compared to assessments conducted by physicians (P < 0.05). Additionally, GPT-3.5 demonstrated statistically significant deviations from physician assessments in speech evaluation (P = 0.009). Among participants with a history of stroke, GPT-3.5 exhibited differences solely in verbal assessment compared to physician-led evaluations (P = 0.002). Notably, through the implementation of optimized scoring methodologies and refinement of interaction protocols, partial mitigation of these disparities was achieved.Conclusion: ChatGPT can produce evaluation outcomes comparable to traditional methods. Despite differences from physician evaluations, refinement of scoring algorithms and interaction protocols has improved alignment. ChatGPT performs well even in populations with specific conditions like stroke, suggesting its versatility. GPT-4 yields results closer to physician ratings, indicating potential for further enhancement. These findings highlight ChatGPT's importance as a supplementary tool, offering new avenues for information gathering in medical fields and guiding its ongoing development and application.","PeriodicalId":51333,"journal":{"name":"DIGITAL HEALTH","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11329975/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DIGITAL HEALTH","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/20552076241264639","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: NLP models like ChatGPT promise to revolutionize text-based content delivery, particularly in medicine. Yet, doubts remain about ChatGPT's ability to reliably support evaluations of cognitive performance, warranting further investigation into its accuracy and comprehensiveness in this area.

Method: A cohort of 60 cognitively normal individuals and 30 stroke survivors underwent a comprehensive evaluation, covering memory, numerical processing, verbal fluency, and abstract thinking. Healthcare professionals and NLP models GPT-3.5 and GPT-4 conducted evaluations following established standards. Scores were compared, and efforts were made to refine scoring protocols and interaction methods to enhance ChatGPT's potential in these evaluations.

Result: Within the cohort of healthy participants, the utilization of GPT-3.5 revealed significant disparities in memory evaluation compared to both physician-led assessments and those conducted utilizing GPT-4 (P < 0.001). Furthermore, within the domain of memory evaluation, GPT-3.5 exhibited discrepancies in 8 out of 21 specific measures when compared to assessments conducted by physicians (P < 0.05). Additionally, GPT-3.5 demonstrated statistically significant deviations from physician assessments in speech evaluation (P = 0.009). Among participants with a history of stroke, GPT-3.5 exhibited differences solely in verbal assessment compared to physician-led evaluations (P = 0.002). Notably, through the implementation of optimized scoring methodologies and refinement of interaction protocols, partial mitigation of these disparities was achieved.

Conclusion: ChatGPT can produce evaluation outcomes comparable to traditional methods. Despite differences from physician evaluations, refinement of scoring algorithms and interaction protocols has improved alignment. ChatGPT performs well even in populations with specific conditions like stroke, suggesting its versatility. GPT-4 yields results closer to physician ratings, indicating potential for further enhancement. These findings highlight ChatGPT's importance as a supplementary tool, offering new avenues for information gathering in medical fields and guiding its ongoing development and application.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估认知能力：传统方法与 ChatGPT

背景：像 ChatGPT 这样的 NLP 模型有望彻底改变基于文本的内容传递方式，尤其是在医学领域。然而，人们对 ChatGPT 能否可靠地支持认知能力评估仍存疑虑，因此有必要进一步研究其在这一领域的准确性和全面性：方法：对 60 名认知正常的人和 30 名中风幸存者进行了全面评估，评估内容包括记忆、数字处理、语言流畅性和抽象思维。医护人员和 NLP 模型 GPT-3.5 和 GPT-4 按照既定标准进行了评估。对得分进行了比较，并努力完善评分协议和互动方法，以提高 ChatGPT 在这些评估中的潜力：结果：在健康参与者群体中，使用 GPT-3.5 与医生主导的评估和使用 GPT-4 进行的评估相比，在记忆评估方面存在显著差异（P P = 0.009）。在有中风史的参与者中，GPT-3.5 与医生主导的评估相比，仅在口头评估方面存在差异（P = 0.002）。值得注意的是，通过实施优化的评分方法和改进互动协议，这些差异得到了部分缓解：结论：ChatGPT 可以产生与传统方法相当的评估结果。尽管与医生评估存在差异，但评分算法和交互协议的改进提高了一致性。即使在中风等特殊情况下，ChatGPT 也能表现出色，这表明它具有多功能性。GPT-4 得出的结果更接近医生的评分，这表明它还有进一步提高的潜力。这些发现凸显了 ChatGPT 作为辅助工具的重要性，为医学领域的信息收集提供了新的途径，并为其持续发展和应用提供了指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

DIGITAL HEALTH Multiple-

CiteScore

2.90

自引率

7.70%

发文量

302