COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

被引：24

作者：

Wen, Keyu ^{[1
]}

Xia, Jin ^{[1
]}

Huang, Yuanyuan ^{[1
]}

Li, Linyang ^{[2
]}

Xu, Jiayan ^{[1
]}

Shao, Jie ^{[1
]}

机构：

[1] ByteDance AI Lab, London, England

[2] Fudan Univ, Shanghai, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

D O I：

10.1109/ICCV48922.2021.00221

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE fine-tuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching.

引用

页码：2188 / 2197

页数：10

共 50 条

[21] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
Machine Intelligence Research, 2023, 20 : 38 - 56
[22] Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wu, Wenhao
Wang, Xiaohan
Luo, Haipeng
Wang, Jingdong
Yang, Yi
Ouyang, Wanli
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6620 - 6630
[23] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Zhou, Mingyang
Zhou, Luowei
Wang, Shuohang
Cheng, Yu
Li, Linjie
Yu, Zhou
Liu, Jingjing
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163
[24] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Jian, Yiren
Gao, Chongyang
Vosoughi, Soroush
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[25] RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
Zhou, Chulun
Liang, Yunlong
Meng, Fandong
Xu, Jinan
Su, Jinsong
Zhou, Jie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11747 - 11762
[26] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*
Luo, Jianjie
Li, Yehao
Pan, Yingwei
Yao, Ting
Chao, Hongyang
Mei, Tao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5600 - 5608
[27] Pre-training A Prompt Pool for Vision-Language Model
Liu, Jun
Gu, Yang
Yang, Zhaohua
Guo, Shuai
Liu, Huaqiu
Chen, Yiqiang
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[28] Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging
Liang, Gongbo
Greenwell, Connor
Zhang, Yu
Xing, Xin
Wang, Xiaoqin
Kavuluru, Ramakanth
Jacobs, Nathan
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (04) : 1640 - 1649
[29] Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition
Lu, Mingqi
Yang, Siyuan
Lu, Xiaobo
Liu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9798 - 9807
[30] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
Zhang, Taolin
He, Sunan
Dai, Tao
Wang, Zhi
Chen, Bin
Xia, Shu-Tao
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7296 - 7304

← 1 2 3 4 5 →