MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

被引:1
|
作者
Shu, Fangxun [1 ]
Chen, Biaolong [1 ]
Liao, Yue [2 ]
Wang, Jinqiao [3 ,4 ]
Liu, Si [2 ]
机构
[1] Alibaba Grp, Beijing 100020, Peoples R China
[2] Beihang Univ, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Redundancy; Computational modeling; Visualization; Training; Semantics; Feature extraction; Contrastive learning; end-to-end pretraining; masked modeling; video-text retrieval;
D O I
10.1109/TMM.2024.3402613
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pre-training (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only take visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: 3x speed up, 60%+ computation reduction, and 4%+ performance improvement. Our MAC achieves state-of-the-art results on various video-text retrieval datasets including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks.
引用
收藏
页码:9962 / 9972
页数:11
相关论文
共 50 条
  • [1] Video-Text Pre-training with Learned Regions for Retrieval
    Yan, Rui
    Shou, Mike Zheng
    Ge, Yixiao
    Wang, Jinpeng
    Lin, Xudong
    Cai, Guanyu
    Tang, Jinhui
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3100 - 3108
  • [2] VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
    Xu, Hu
    Ghosh, Gargi
    Huang, Po-Yao
    Okhonko, Dmytro
    Aghajanyan, Armen
    Metze, Florian
    Zettlemoyer, Luke
    Feichtenhofer, Christoph
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6787 - 6800
  • [3] LocVTP: Video-Text Pre-training for Temporal Localization
    Cao, Meng
    Yang, Tianyu
    Weng, Junwu
    Zhang, Can
    Wang, Jue
    Zou, Yuexian
    COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 38 - 56
  • [4] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
    Ge, Yuying
    Ge, Yixiao
    Liu, Xihui
    Wang, Jinpeng
    Wu, Jianping
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708
  • [5] Stitching Segments and Sentences towards Generalization in Video-Text Pre-training
    Ma, Fan
    Jin, Xiaojie
    Wang, Heng
    Huang, Jingjia
    Zhu, Linchao
    Yang, Yi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4080 - 4088
  • [6] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
  • [7] Expert-guided contrastive learning for video-text retrieval
    Lee, Jewook
    Lee, Pilhyeon
    Park, Sungho
    Byun, Hyeran
    NEUROCOMPUTING, 2023, 536 : 50 - 58
  • [8] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [9] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
    Liu, Zhi
    Zhao, Fangyuan
    Zhang, Mengmeng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
  • [10] MimCo: Masked Image Modeling Pre-training with Contrastive Teacher
    Zhou, Qiang
    Yu, Chaohui
    Luo, Hao
    Wang, Zhibin
    Li, Hao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4487 - 4495