UNITER: UNiversal Image-TExt Representation Learning

被引:1269
作者
Chen, Yen-Chun [1 ]
Li, Linjie [1 ]
Yu, Licheng [1 ]
El Kholy, Ahmed [1 ]
Ahmed, Faisal [1 ]
Gan, Zhe [1 ]
Cheng, Yu [1 ]
Liu, Jingjing [1 ]
机构
[1] Microsoft Dynam 365 AI Res, Redmond, WA 98052 USA
来源
COMPUTER VISION - ECCV 2020, PT XXX | 2020年 / 12375卷
关键词
D O I
10.1007/978-3-030-58577-8_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2 (Code is available at https://github.com/ChenRocks/UNITER.).
引用
收藏
页码:104 / 120
页数:17
相关论文
共 50 条
[1]  
Alberti C, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P2131
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]  
Cao JZ, 2020, Arxiv, DOI arXiv:2005.07310
[5]  
Chen L., 2020, ICML
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   Unsupervised Visual Representation Learning by Context Prediction [J].
Doersch, Carl ;
Gupta, Abhinav ;
Efros, Alexei A. .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1422-1430
[8]  
Fukui A., 2017, EMNLP
[9]  
Gan Z, 2020, Arxiv, DOI arXiv:2006.06195
[10]   Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering [J].
Gao, Peng ;
Jiang, Zhengkai ;
You, Haoxuan ;
Lu, Pan ;
Hoi, Steven ;
Wang, Xiaogang ;
Li, Hongsheng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6632-6641