Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

被引:47
作者
Cao, Jize [1 ]
Gan, Zhe [2 ]
Cheng, Yu [2 ]
Yu, Licheng [3 ]
Chen, Yen-Chun [2 ]
Liu, Jingjing [2 ]
机构
[1] Univ Washington, Seattle, WA 98195 USA
[2] Microsoft Dynam 365 AI Res, Redmond, WA USA
[3] Facebook AI, Menlo Pk, CA USA
来源
COMPUTER VISION - ECCV 2020, PT VI | 2020年 / 12351卷
关键词
D O I
10.1007/978-3-030-58539-6_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene, we present Value (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection) generalizable to standard pre-trained V+L models, to decipher the inner workings of multimodal pre-training (e.g., implicit knowledge garnered in individual attention heads, inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training. (Code is available at https://github.com/JizeCao/VALUE).
引用
收藏
页码:565 / 580
页数:16
相关论文
共 40 条
[1]  
Alberti C, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P2131
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]  
Bouraoui Z, 2020, AAAI CONF ARTIF INTE, V34, P7456
[5]  
Chen YC, 2020, Arxiv, DOI [arXiv:1909.11740, DOI 10.48550/ARXIV.1909.11740]
[6]  
Clark K, 2019, Arxiv, DOI [arXiv:1906.04341, DOI 10.48550/ARXIV.1906.04341]
[7]  
Conneau A, 2018, Arxiv, DOI arXiv:1803.05449
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Gan Z, 2020, Arxiv, DOI arXiv:2006.06195
[10]   Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].
Goyal, Yash ;
Khot, Tejas ;
Agrawal, Aishwarya ;
Summers-Stay, Douglas ;
Batra, Dhruv ;
Parikh, Devi .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (04) :398-414