Do Multimodal Large Language Models and Humans Ground Language Similarly?

被引:0
|
作者
Jones, Cameron R. [1 ]
Bergen, Benjamin [1 ]
Trott, Sean [1 ]
机构
[1] Univ Calif San Diego, Dept Cognit Sci, San Diego, CA 92093 USA
关键词
REPRESENTATION; ORIENTATION; EMBODIMENT; MOTOR;
D O I
10.1162/coli_a_00531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world-for failing to solve the "symbol grounding problem." Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities-and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through "embodied simulation," the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM's lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture-despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
引用
收藏
页码:1415 / 1440
页数:26
相关论文
共 50 条
  • [41] InteraRec: Interactive Recommendations Using Multimodal Large Language Models
    Karra, Saketh Reddy
    Tulabandhula, Theja
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2024 WORKSHOPS, RAFDA AND IWTA, 2024, 14658 : 32 - 43
  • [42] Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
    Zhang, Yichi
    Dong, Yinpeng
    Zhang, Siyuan
    Min, Tianzan
    Su, Hang
    Zhu, Jun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26552 - 26562
  • [43] UniCode: Learning a Unified Codebook for Multimodal Large Language Models
    Zheng, Sipeng
    Zhou, Bohan
    Feng, Yicheng
    Wang, Ye
    Lu, Zongqing
    COMPUTER VISION - ECCV 2024, PT VIII, 2025, 15066 : 426 - 443
  • [44] QueryMintAI: Multipurpose Multimodal Large Language Models for Personal Data
    Ghosh, Ananya
    Deepa, K.
    IEEE ACCESS, 2024, 12 : 144631 - 144651
  • [45] BLINK: Multimodal Large Language Models Can See but Not Perceive
    Fu, Xingyu
    Hu, Yushi
    Li, Bangzheng
    Feng, Yu
    Wang, Haoyu
    Lin, Xudong
    Roth, Dan
    Smith, Noah A.
    Ma, Wei-Chiu
    Krishna, Ranjay
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 148 - 166
  • [46] Multimodal Large Language Models as Built Environment Auditing Tools
    Jang, Kee Moon
    Kim, Junghwan
    PROFESSIONAL GEOGRAPHER, 2025, 77 (01): : 84 - 90
  • [47] Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction
    He, Wentao
    Ma, Hanjie
    Li, Shaohua
    Dong, Hui
    Zhang, Haixiang
    Feng, Jie
    APPLIED SCIENCES-BASEL, 2023, 13 (22):
  • [48] Multimodal Neural Language Models
    Kiros, Ryan
    Salakhutdinov, Ruslan
    Zemel, Richard
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 595 - 603
  • [49] Large language models: What could they do for neurology?
    Lajoie, Guillaume
    JOURNAL OF THE NEUROLOGICAL SCIENCES, 2023, 455
  • [50] Do Large Language Models Bias Human Evaluations?
    O'Leary, Daniel E.
    IEEE INTELLIGENT SYSTEMS, 2024, 39 (04) : 83 - 87