Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

被引：2

作者：

Alper, Morris ^{[1
]}

Fiman, Michael ^{[1
]}

Averbuch-Elor, Hadar ^{[1
]}

机构：

[1] Tel Aviv Univ, Tel Aviv, Israel

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

COLOR;

D O I：

10.1109/CVPR52729.2023.00655

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being under-performed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.

引用

页码：6778 / 6788

页数：11

共 50 条

[41] ENVEDIT: Environment Editing for Vision-and-Language Navigation
Li, Jialu
Tan, Hao
Bansal, Mohit
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
[42] Diagnosing the Environment Bias in Vision-and-Language Navigation
Zhang, Yubo
Tan, Hao
Bansal, Mohit
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
[43] KAT: A Knowledge Augmented Transformer for Vision-and-Language
Gui, Liangke
Wang, Borui
Huang, Qiuyuan
Hauptmann, Alexander
Bisk, Yonatan
Gao, Jianfeng
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 956 - 968
[44] Topological Planning with Transformers for Vision-and-Language Navigation
Chen, Kevin
Chen, Junshen K.
Chuang, Jo
Vazquez, Marynel
Savarese, Silvio
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
[45] Effective End-to-End Vision Language Pretraining With Semantic Visual Loss
Yang, Xiaofeng
Liu, Fayao
Lin, Guosheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8408 - 8417
[46] Scaling Data Generation in Vision-and-Language Navigation
Wang, Zun
Li, Jialu
Hong, Yicong
Wang, Yi
Wu, Qi
Bansal, Mohit
Gould, Stephen
Tan, Hao
Qiao, Yu
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
[47] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
Liu, Shubo
Zhang, Hongsheng
Qi, Yuankai
Wang, Peng
Zhang, Yanning
Wu, Qi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
[48] Vision-and-Language Navigation via Causal Learning
Wang, Liuyi
He, Zongtao
Dang, Ronghao
Shen, Mengjiao
Liu, Chengju
Chen, Qijun
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
[49] SELF-SUPERVISED VISION-LANGUAGE PRETRAINING FOR MEDIAL VISUAL QUESTION ANSWERING
Li, Pengfei
Liu, Gang
Tan, Lin
Liao, Jinying
Zhong, Shenjun
2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
[50] Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation
Kuo, Chia-Wen
Ma, Chih-Yao
Hoffman, Judy
Kira, Zsolt
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1104 - 1113

← 1 2 3 4 5 →