Do Multimodal Large Language Models and Humans Ground Language Similarly?

被引：0

作者：

Jones, Cameron R. ^{[1
]}

Bergen, Benjamin ^{[1
]}

Trott, Sean ^{[1
]}

机构：

[1] Univ Calif San Diego, Dept Cognit Sci, San Diego, CA 92093 USA

来源：

COMPUTATIONAL LINGUISTICS | 2024年 / 50卷 / 04期

关键词：

REPRESENTATION; ORIENTATION; EMBODIMENT; MOTOR;

D O I：

10.1162/coli_a_00531

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world-for failing to solve the "symbol grounding problem." Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities-and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through "embodied simulation," the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM's lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture-despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.

引用

页码：1415 / 1440

页数：26

共 50 条

[31] Computing Architecture for Large-Language Models (LLMs) and Large Multimodal Models (LMMs)
Liang, Bor-Sung
PROCEEDINGS OF THE 2024 INTERNATIONAL SYMPOSIUM ON PHYSICAL DESIGN, ISPD 2024, 2024, : 233 - 234
[32] Evaluating the Language Abilities of Large Language Models vs. Humans: Three Caveats
Leivada, Evelina
Dentella, Vittoria
Guenther, Fritz
BIOLINGUISTICS, 2024, 18
[33] SEED-Bench: Benchmarking Multimodal Large Language Models
Li, Bohao
Ge, Yuying
Ge, Yixiao
Wang, Guangzhi
Wang, Rui
Zhang, Ruimao
Shi, Ying
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
[34] Large Language and Emerging Multimodal Foundation Models: Boundless Opportunities
Forghani, Reza
RADIOLOGY, 2024, 313 (01)
[35] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jain, Jitesh
Yang, Jianwei
Shi, Humphrey
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27992 - 28002
[36] Multimodal large language models for inclusive collaboration learning tasks
Lewis, Armanda
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 202 - 210
[37] Large Language and Multimodal Models Don't Come Cheap
Anderson, Margo
Perry, Tekla S.
IEEE SPECTRUM, 2023, 60 (07) : 13 - 13
[38] Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis
Omar, Mahmud
Agbareia, Reem
Klang, Eyal
Naffaa, Mohammaed E.
JOURNAL OF RHEUMATOLOGY, 2025, 52 (02) : 187 - 188
[39] Enhancing Urban Walkability Assessment with Multimodal Large Language Models
Blecic, Ivan
Saiu, Valeria
Trunfio, Giuseppe A.
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS-ICCSA 2024 WORKSHOPS, PT V, 2024, 14819 : 394 - 411
[40] Large Language Models Empower Multimodal Integrated Sensing and Communication
Cheng, Lu
Zhang, Hongliang
Di, Boya
Niyato, Dusit
Song, Lingyang
IEEE COMMUNICATIONS MAGAZINE, 2025,

← 1 2 3 4 5 →