Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

被引：55

作者：

Hendricks, Lisa Anne ^{[1
]}

Mellor, John ^{[1
]}

Schneider, Rosalia ^{[1
]}

Alayrac, Jean-Baptiste ^{[1
]}

Nematzadeh, Aida ^{[1
]}

机构：

[1] DeepMind, London, England

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2021年 / 9卷

关键词：

Computational linguistics - Image retrieval;

D O I：

10.1162/tacl_a_00385

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.

引用

页码：570 / 585

页数：16

共 50 条

[1]

Akula Arjun, 2020, PROCEED INGS 58 AN, P6555, DOI 10.18653/v1/2020.acl-main.586

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

[Anonymous], 2015, PROC ADVNEURAL INF P

[4] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[5]

Brown TB, 2020, ADV NEUR IN, V33

[6] Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [J].

Cao, Jize ;

Gan, Zhe ;

Cheng, Yu ;

Yu, Licheng ;

Chen, Yen-Chun ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT VI, 2020, 12351 :565-580

[7]

Chen X, 2015, Microsoft coco captions: Data collection and evaluation server," in, V1504, P325

[8] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Duygulu P, 2002, LECT NOTES COMPUT SC, V2353, P97

← 1 2 3 4 5 →