Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

被引：2

作者：

Alper, Morris ^{[1
]}

Fiman, Michael ^{[1
]}

Averbuch-Elor, Hadar ^{[1
]}

机构：

[1] Tel Aviv Univ, Tel Aviv, Israel

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

COLOR;

D O I：

10.1109/CVPR52729.2023.00655

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being under-performed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.

引用

页码：6778 / 6788

页数：11

共 50 条

[31] MAGVLT: Masked Generative Vision-and-Language Transformer
Kim, Sungwoong
Jo, Daejin
Lee, Donghoon
Kim, Jongmin
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
[32] Masked Path Modeling for Vision-and-Language Navigation
Dou, Zi-Yi
Gao, Feng
Peng, Nanyun
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15255 - 15269
[33] Federated Learning for Vision-and-Language Grounding Problems
Liu, Fenglin
Wu, Xian
Ge, Shen
Fan, Wei
Zou, Yuexian
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11572 - 11579
[34] VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation
Zheng, Kaizhi
Chen, Xiaotong
Jenkins, Odest Chadwicke
Wang, Xin Eric
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[35] Local Slot Attention for Vision-and-Language Navigation
Zhuang, Yifeng
Sun, Qiang
Fu, Yanwei
Chen, Lifeng
Xue, Xiangyang
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553
[36] Improved Speaker and Navigator for Vision-and-Language Navigation
Wu, Zongkai
Liu, Zihan
Wang, Ting
Wang, Donglin
IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63
[37] Behavioral Analysis of Vision-and-Language Navigation Agents
Yang, Zijiao
Majumdar, Arjun
Lee, Stefan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2574 - 2582
[38] Transferable Representation Learning in Vision-and-Language Navigation
Huang, Haoshuo
Jain, Vihan
Mehta, Harsh
Ku, Alexander
Magalhaes, Gabriel
Baldridge, Jason
Ie, Eugene
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
[39] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Zhou, Gengze
Hong, Yicong
Wu, Qi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
[40] VLSlice: Interactive Vision-and-Language Slice Discovery
Slyman, Eric
Kahng, Minsuk
Lee, Stefan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15245 - 15255

← 1 2 3 4 5 →