Do Multimodal Large Language Models and Humans Ground Language Similarly?

IF 9.3 2区 计算机科学 Computational Linguistics Pub Date : 2024-07-30 DOI:10.1162/coli_a_00531
Cameron Jones, Benjamin Bergen, Sean Trott
{"title":"Do Multimodal Large Language Models and Humans Ground Language Similarly?","authors":"Cameron Jones, Benjamin Bergen, Sean Trott","doi":"10.1162/coli_a_00531","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00531","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多模态大型语言模型与人类的语言基础相似吗?
大语言模型(LLMs)因未能将语言意义与世界联系起来--未能解决 "符号基础问题"--而饱受批评。多模态大语言模型(MLLMs)通过将语言表征和处理与其他模态相结合,为这一难题提供了潜在的解决方案。然而,对于多模态大语言模型如何以及在多大程度上整合其不同的模态--它们这样做的方式是否反映了人们认为的人类接地机制--还有很多未知数。据推测,人类的语言意义是通过 "具身模拟"(embodied simulation)来实现的,即通过激活感官运动和情感表征来反映所描述的体验。通过四项预先登记的研究,我们调整了最初为研究人类理解者的具身模拟而开发的实验技术,以探究 MLLM 是否对事件描述中隐含但不明确的感觉运动特征敏感。在实验 1 中,我们发现 MLLM 对某些特征(颜色和形状)很敏感,但对其他特征(大小、方向和体积)却不敏感。在实验 2 中,我们发现了 MLLM 缺乏敏感性的可能瓶颈。在实验 3 中,我们发现尽管 MLLM 对内隐感觉运动特征很敏感,但它并不能完全解释人类在同一任务中的行为。最后,在实验 4 中,我们比较了不同 MLLM 架构的心理测量预测能力,发现单流架构 ViLT 比双编码器架构 CLIP 更能预测人类对一个感觉运动特征(形状)的反应,尽管后者的训练数据要少得多。这些结果揭示了当前 MLLM 将语言与其他模态整合的能力的优势和局限性,同时也揭示了人类语言理解的可能机制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computational Linguistics
Computational Linguistics Computer Science-Artificial Intelligence
自引率
0.00%
发文量
45
期刊介绍: Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.
期刊最新文献
Dotless Arabic text for Natural Language Processing Humans Learn Language from Situated Communicative Interactions. What about Machines? Exploring temporal sensitivity in the brain using multi-timescale language models: an EEG decoding study Meaning beyond lexicality: Capturing Pseudoword Definitions with Language Models Perception of Phonological Assimilation by Neural Speech Recognition Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1