MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

被引：1

作者：

Shu, Fangxun ^{[1
]}

Chen, Biaolong ^{[1
]}

Liao, Yue ^{[2
]}

Wang, Jinqiao ^{[3
,4
]}

Liu, Si ^{[2
]}

机构：

[1] Alibaba Grp, Beijing 100020, Peoples R China

[2] Beihang Univ, Inst Artificial Intelligence, Beijing 100083, Peoples R China

[3] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Task analysis; Redundancy; Computational modeling; Visualization; Training; Semantics; Feature extraction; Contrastive learning; end-to-end pretraining; masked modeling; video-text retrieval;

D O I：

10.1109/TMM.2024.3402613

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pre-training (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only take visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: 3x speed up, 60%+ computation reduction, and 4%+ performance improvement. Our MAC achieves state-of-the-art results on various video-text retrieval datasets including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks.

引用

页码：9962 / 9972

页数：11

共 50 条

[1] Video-Text Pre-training with Learned Regions for Retrieval
Yan, Rui
Shou, Mike Zheng
Ge, Yixiao
Wang, Jinpeng
Lin, Xudong
Cai, Guanyu
Tang, Jinhui
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3100 - 3108
[2] VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Xu, Hu
Ghosh, Gargi
Huang, Po-Yao
Okhonko, Dmytro
Aghajanyan, Armen
Metze, Florian
Zettlemoyer, Luke
Feichtenhofer, Christoph
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6787 - 6800
[3] LocVTP: Video-Text Pre-training for Temporal Localization
Cao, Meng
Yang, Tianyu
Weng, Junwu
Zhang, Can
Wang, Jue
Zou, Yuexian
COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 38 - 56
[4] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
Ge, Yuying
Ge, Yixiao
Liu, Xihui
Wang, Jinpeng
Wu, Jianping
Shan, Ying
Qie, Xiaohu
Luo, Ping
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708
[5] Stitching Segments and Sentences towards Generalization in Video-Text Pre-training
Ma, Fan
Jin, Xiaojie
Wang, Heng
Huang, Jingjia
Zhu, Linchao
Yang, Yi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4080 - 4088
[6] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
Ma, Wentao
Chen, Qingchao
Zhou, Tongqing
Zhao, Shan
Cai, Zhiping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
[7] Expert-guided contrastive learning for video-text retrieval
Lee, Jewook
Lee, Pilhyeon
Park, Sungho
Byun, Hyeran
NEUROCOMPUTING, 2023, 536 : 50 - 58
[8] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
Shen, Xiaobo
Huang, Qianxin
Lan, Long
Zheng, Yuhui
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
[9] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
Liu, Zhi
Zhao, Fangyuan
Zhang, Mengmeng
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
[10] MimCo: Masked Image Modeling Pre-training with Contrastive Teacher
Zhou, Qiang
Yu, Chaohui
Luo, Hao
Wang, Zhibin
Li, Hao
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4487 - 4495

← 1 2 3 4 5 →