Testing theory of mind in large language models and humans

IF 21.4 1区 心理学 Q1 MULTIDISCIPLINARY SCIENCES Nature Human Behaviour Pub Date : 2024-05-20 DOI:10.1038/s41562-024-01882-z
James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, Cristina Becchio
{"title":"Testing theory of mind in large language models and humans","authors":"James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, Cristina Becchio","doi":"10.1038/s41562-024-01882-z","DOIUrl":null,"url":null,"abstract":"At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences. Testing two families of large language models (LLMs) (GPT and LLaMA2) on a battery of measurements spanning different theory of mind abilities, Strachan et al. find that the performance of LLMs can mirror that of humans on most of these tasks. The authors explored potential reasons for this.","PeriodicalId":19074,"journal":{"name":"Nature Human Behaviour","volume":null,"pages":null},"PeriodicalIF":21.4000,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11272575/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Human Behaviour","FirstCategoryId":"102","ListUrlMain":"https://www.nature.com/articles/s41562-024-01882-z","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences. Testing two families of large language models (LLMs) (GPT and LLaMA2) on a battery of measurements spanning different theory of mind abilities, Strachan et al. find that the performance of LLMs can mirror that of humans on most of these tasks. The authors explored potential reasons for this.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
测试大型语言模型和人类的思维理论。
心智理论是人类的核心概念:即追踪他人心理状态的能力。最近,大型语言模型(LLMs)(如 ChatGPT)的发展引发了一场激烈的争论,即这些模型在心智理论任务中表现出的行为是否与人类行为无异。在这里,我们比较了人类和 LLM 在一系列综合测量中的表现,这些测量旨在测量不同的心智理论能力,包括理解虚假信念、解释间接请求、识别讽刺和假话等。我们针对这些测试反复测试了两个系列的 LLM(GPT 和 LLaMA2),并将它们的表现与 1,907 名人类参与者的表现进行了比较。通过一系列心智理论测试,我们发现 GPT-4 模型在识别间接请求、虚假信念和误导方面的表现与人类水平相当,有时甚至高于人类水平,但在检测假动作方面却很吃力。然而,只有在 "假象 "测试中,LLaMA2 的表现优于人类。对信念可能性的后续处理表明,LLaMA2 的优势是虚幻的,这可能反映了一种归因于无知的偏差。相比之下,GPT 的糟糕表现源于对结论的过度保守,而非真正的推理失败。这些发现不仅证明了 LLMs 所表现出的行为与人类心智推理的结果是一致的,而且还强调了系统测试的重要性,以确保人类与人工智能之间的比较不会流于表面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Nature Human Behaviour
Nature Human Behaviour Psychology-Social Psychology
CiteScore
36.80
自引率
1.00%
发文量
227
期刊介绍: Nature Human Behaviour is a journal that focuses on publishing research of outstanding significance into any aspect of human behavior.The research can cover various areas such as psychological, biological, and social bases of human behavior.It also includes the study of origins, development, and disorders related to human behavior.The primary aim of the journal is to increase the visibility of research in the field and enhance its societal reach and impact.
期刊最新文献
How to do research in classroom settings Human culture is uniquely open-ended rather than uniquely cumulative Two-dimensional neural geometry underpins hierarchical organization of sequence in human working memory Large-scale exome sequencing identified 18 novel genes for neuroticism in 394,005 UK-based individuals Association and causal mediation between marital status and depression in seven countries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1