Do Multimodal Large Language Models and Humans Ground Language Similarly?

被引:0
|
作者
Jones, Cameron R. [1 ]
Bergen, Benjamin [1 ]
Trott, Sean [1 ]
机构
[1] Univ Calif San Diego, Dept Cognit Sci, San Diego, CA 92093 USA
关键词
REPRESENTATION; ORIENTATION; EMBODIMENT; MOTOR;
D O I
10.1162/coli_a_00531
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world-for failing to solve the "symbol grounding problem." Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities-and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through "embodied simulation," the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM's lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture-despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.
引用
收藏
页码:1415 / 1440
页数:26
相关论文
共 50 条
  • [21] Woodpecker: hallucination correction for multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Xu, Tong
    Wang, Hao
    Sui, Dianbo
    Shen, Yunhang
    Li, Ke
    Sun, Xing
    Chen, Enhong
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (12)
  • [22] Woodpecker: hallucination correction for multimodal large language models
    Shukang YIN
    Chaoyou FU
    Sirui ZHAO
    Tong XU
    Hao WANG
    Dianbo SUI
    Yunhang SHEN
    Ke LI
    Xing SUN
    Enhong CHEN
    Science China(Information Sciences), 2024, 67 (12) : 52 - 64
  • [23] A Refer-and-Ground Multimodal Large Language Model for Biomedicine
    Huang, Xiaoshuang
    Huang, Haifeng
    Shen, Lingdong
    Yang, Yehui
    Shang, Fangxin
    Liu, Junwei
    Liu, Jia
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 399 - 409
  • [24] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
    Chen, Zheyi
    Xu, Liuchang
    Zheng, Hongting
    Chen, Luyao
    Tolba, Amr
    Zhao, Liang
    Yu, Keping
    Feng, Hailin
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (02): : 1753 - 1808
  • [25] Do large language models "understand" their knowledge?
    Venkatasubramanian, Venkat
    AICHE JOURNAL, 2025, 71 (03)
  • [26] Why do large language models hallucinate?
    Waldo, Jim
    Boussard, Soline
    COMMUNICATIONS OF THE ACM, 2025, 68 (01) : 40 - 45
  • [27] Do Large Language Models Understand Us?
    Aguera y Arcas, Blaise
    DAEDALUS, 2022, 151 (02) : 183 - 197
  • [28] Testing the limits of large language models in debating humans
    James Flamino
    Mohammed Shahid Modi
    Boleslaw K. Szymanski
    Brendan Cross
    Colton Mikolajczyk
    Scientific Reports, 15 (1)
  • [29] Testing theory of mind in large language models and humans
    Strachan, James W. A.
    Albergo, Dalila
    Borghini, Giulia
    Pansardi, Oriana
    Scaliti, Eugenio
    Gupta, Saurabh
    Saxena, Krati
    Rufo, Alessandro
    Panzeri, Stefano
    Manzi, Guido
    Graziano, Michael S. A.
    Becchio, Cristina
    NATURE HUMAN BEHAVIOUR, 2024, 8 (07): : 1285 - 1295
  • [30] Towards Language-Driven Video Inpainting via Multimodal Large Language Models
    Wu, Jianzong
    Li, Xiangtai
    Si, Chenyang
    Zhou, Shangchen
    Yang, Jingkang
    Zhang, Jiangning
    Li, Yining
    Chen, Kai
    Tong, Yunhai
    Liu, Ziwei
    Loy, Chen Change
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12501 - 12511