Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

被引:2
|
作者
Alper, Morris [1 ]
Fiman, Michael [1 ]
Averbuch-Elor, Hadar [1 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
关键词
COLOR;
D O I
10.1109/CVPR52729.2023.00655
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being under-performed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.
引用
收藏
页码:6778 / 6788
页数:11
相关论文
共 50 条
  • [31] MAGVLT: Masked Generative Vision-and-Language Transformer
    Kim, Sungwoong
    Jo, Daejin
    Lee, Donghoon
    Kim, Jongmin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
  • [32] Masked Path Modeling for Vision-and-Language Navigation
    Dou, Zi-Yi
    Gao, Feng
    Peng, Nanyun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15255 - 15269
  • [33] Federated Learning for Vision-and-Language Grounding Problems
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Fan, Wei
    Zou, Yuexian
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11572 - 11579
  • [34] VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation
    Zheng, Kaizhi
    Chen, Xiaotong
    Jenkins, Odest Chadwicke
    Wang, Xin Eric
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [35] Local Slot Attention for Vision-and-Language Navigation
    Zhuang, Yifeng
    Sun, Qiang
    Fu, Yanwei
    Chen, Lifeng
    Xue, Xiangyang
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553
  • [36] Improved Speaker and Navigator for Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Ting
    Wang, Donglin
    IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63
  • [37] Behavioral Analysis of Vision-and-Language Navigation Agents
    Yang, Zijiao
    Majumdar, Arjun
    Lee, Stefan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2574 - 2582
  • [38] Transferable Representation Learning in Vision-and-Language Navigation
    Huang, Haoshuo
    Jain, Vihan
    Mehta, Harsh
    Ku, Alexander
    Magalhaes, Gabriel
    Baldridge, Jason
    Ie, Eugene
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
  • [39] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [40] VLSlice: Interactive Vision-and-Language Slice Discovery
    Slyman, Eric
    Kahng, Minsuk
    Lee, Stefan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15245 - 15255