Do Multimodal Large Language Models and Humans Ground Language Similarly?

被引:0
|
作者
Jones, Cameron R. [1 ]
Bergen, Benjamin [1 ]
Trott, Sean [1 ]
机构
[1] Univ Calif San Diego, Dept Cognit Sci, San Diego, CA 92093 USA
关键词
REPRESENTATION; ORIENTATION; EMBODIMENT; MOTOR;
D O I
10.1162/coli_a_00531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world-for failing to solve the "symbol grounding problem." Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities-and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through "embodied simulation," the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM's lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture-despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
引用
收藏
页码:1415 / 1440
页数:26
相关论文
共 50 条
  • [1] Large language models can segment narrative events similarly to humans
    Michelmann, Sebastian
    Kumar, Manoj
    Norman, Kenneth A.
    Toneva, Mariya
    BEHAVIOR RESEARCH METHODS, 2025, 57 (01)
  • [2] Do multimodal large language models understand welding?
    Khvatskii, Grigorii
    Lee, Yong Suk
    Angst, Corey
    Gibbs, Maria
    Landers, Robert
    Chawla, Nitesh V.
    INFORMATION FUSION, 2025, 120
  • [3] Do Large Language Models Know What Humans Know?
    Trott, Sean
    Jones, Cameron
    Chang, Tyler
    Michaelov, James
    Bergen, Benjamin
    COGNITIVE SCIENCE, 2023, 47 (07)
  • [4] A survey on multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Li, Ke
    Sun, Xing
    Xu, Tong
    Chen, Enhong
    NATIONAL SCIENCE REVIEW, 2024, 11 (12)
  • [5] A survey on multimodal large language models
    Shukang Yin
    Chaoyou Fu
    Sirui Zhao
    Ke Li
    Xing Sun
    Tong Xu
    Enhong Chen
    National Science Review, 2024, 11 (12) : 277 - 296
  • [6] The Language of Creativity: Evidence from Humans and Large Language Models
    Orwig, William
    Edenbaum, Emma R.
    Greene, Joshua D.
    Schacter, Daniel L.
    JOURNAL OF CREATIVE BEHAVIOR, 2024, 58 (01): : 128 - 136
  • [7] Multimodal Large Language Models in Vision and Ophthalmology
    Lu, Zhiyong
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2024, 65 (07)
  • [8] The application of multimodal large language models in medicine
    Qiu, Jianing
    Yuan, Wu
    Lam, Kyle
    LANCET REGIONAL HEALTH-WESTERN PACIFIC, 2024, 45
  • [9] Visual cognition in multimodal large language models
    Buschoff, Luca M. Schulze
    Akata, Elif
    Bethge, Matthias
    Schulz, Eric
    NATURE MACHINE INTELLIGENCE, 2025, 7 (01) : 96 - 106
  • [10] Multimodal large language models for bioimage analysis
    Zhang, Shanghang
    Dai, Gaole
    Huang, Tiejun
    Chen, Jianxu
    NATURE METHODS, 2024, 21 (08) : 1390 - 1393