Do Multimodal Large Language Models and Humans Ground Language Similarly?

被引:0
|
作者
Jones, Cameron R. [1 ]
Bergen, Benjamin [1 ]
Trott, Sean [1 ]
机构
[1] Univ Calif San Diego, Dept Cognit Sci, San Diego, CA 92093 USA
关键词
REPRESENTATION; ORIENTATION; EMBODIMENT; MOTOR;
D O I
10.1162/coli_a_00531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world-for failing to solve the "symbol grounding problem." Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities-and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through "embodied simulation," the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM's lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture-despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
引用
收藏
页码:1415 / 1440
页数:26
相关论文
共 50 条
  • [31] Computing Architecture for Large-Language Models (LLMs) and Large Multimodal Models (LMMs)
    Liang, Bor-Sung
    PROCEEDINGS OF THE 2024 INTERNATIONAL SYMPOSIUM ON PHYSICAL DESIGN, ISPD 2024, 2024, : 233 - 234
  • [32] Evaluating the Language Abilities of Large Language Models vs. Humans: Three Caveats
    Leivada, Evelina
    Dentella, Vittoria
    Guenther, Fritz
    BIOLINGUISTICS, 2024, 18
  • [33] SEED-Bench: Benchmarking Multimodal Large Language Models
    Li, Bohao
    Ge, Yuying
    Ge, Yixiao
    Wang, Guangzhi
    Wang, Rui
    Zhang, Ruimao
    Shi, Ying
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
  • [34] Large Language and Emerging Multimodal Foundation Models: Boundless Opportunities
    Forghani, Reza
    RADIOLOGY, 2024, 313 (01)
  • [35] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
    Jain, Jitesh
    Yang, Jianwei
    Shi, Humphrey
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27992 - 28002
  • [36] Multimodal large language models for inclusive collaboration learning tasks
    Lewis, Armanda
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 202 - 210
  • [37] Large Language and Multimodal Models Don't Come Cheap
    Anderson, Margo
    Perry, Tekla S.
    IEEE SPECTRUM, 2023, 60 (07) : 13 - 13
  • [38] Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis
    Omar, Mahmud
    Agbareia, Reem
    Klang, Eyal
    Naffaa, Mohammaed E.
    JOURNAL OF RHEUMATOLOGY, 2025, 52 (02) : 187 - 188
  • [39] Enhancing Urban Walkability Assessment with Multimodal Large Language Models
    Blecic, Ivan
    Saiu, Valeria
    Trunfio, Giuseppe A.
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS-ICCSA 2024 WORKSHOPS, PT V, 2024, 14819 : 394 - 411
  • [40] Large Language Models Empower Multimodal Integrated Sensing and Communication
    Cheng, Lu
    Zhang, Hongliang
    Di, Boya
    Niyato, Dusit
    Song, Lingyang
    IEEE COMMUNICATIONS MAGAZINE, 2025,