UNITER: UNiversal Image-TExt Representation Learning

被引：1269

作者：

Chen, Yen-Chun ^{[1
]}

Li, Linjie ^{[1
]}

Yu, Licheng ^{[1
]}

El Kholy, Ahmed ^{[1
]}

Ahmed, Faisal ^{[1
]}

Gan, Zhe ^{[1
]}

Cheng, Yu ^{[1
]}

Liu, Jingjing ^{[1
]}

机构：

[1] Microsoft Dynam 365 AI Res, Redmond, WA 98052 USA

来源：

COMPUTER VISION - ECCV 2020, PT XXX | 2020年 / 12375卷

关键词：

D O I：

10.1007/978-3-030-58577-8_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2 (Code is available at https://github.com/ChenRocks/UNITER.).

引用

页码：104 / 120

页数：17

共 50 条

[1]

Alberti C, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P2131

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[4]

Cao JZ, 2020, Arxiv, DOI arXiv:2005.07310

[5]

Chen L., 2020, ICML

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7] Unsupervised Visual Representation Learning by Context Prediction [J].

Doersch, Carl ;

Gupta, Abhinav ;

Efros, Alexei A. .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1422-1430

[8]

Fukui A., 2017, EMNLP

[9]

Gan Z, 2020, Arxiv, DOI arXiv:2006.06195

[10] Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering [J].

Gao, Peng ;

Jiang, Zhengkai ;

You, Haoxuan ;

Lu, Pan ;

Hoi, Steven ;

Wang, Xiaogang ;

Li, Hongsheng .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6632-6641

← 1 2 3 4 5 →