Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引:0
|
作者
Fang, Han [1 ]
Yang, Zhifei [1 ]
Zang, Xianghao [1 ]
Ban, Chao [1 ]
He, Zhongjiang [1 ]
Sun, Hao [1 ]
Zhou, Lanxiang [1 ]
机构
[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China
关键词
Video-Text Retrieval; Mask Video Modeling; Attention;
D O I
10.1145/3581783.3611756
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.
引用
收藏
页码:3847 / 3856
页数:10
相关论文
共 50 条
  • [1] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
    Shi, Yaya
    Liu, Haowei
    Xu, Haiyang
    Ma, Zongyang
    Ye, Qinghao
    Hu, Anwen
    Yan, Ming
    Zhang, Ji
    Huang, Fei
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470
  • [2] Boosting Video-Text Retrieval with Explicit High-Level Semantics
    Wang, Haoran
    Xu, Di
    He, Dongliang
    Li, Fu
    Ji, Zhong
    Han, Jungong
    Ding, Errui
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898
  • [3] Multi-event Video-Text Retrieval
    Zhang, Gengyuan
    Ren, Jisen
    Gu, Jindong
    Tresp, Volker
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
  • [4] A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL
    Li, Zheng
    Guo, Caili
    Yang, Bo
    Feng, Zerun
    Zhang, Hao
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [5] Deep learning for video-text retrieval: a review
    Zhu, Cunjuan
    Jia, Qi
    Chen, Wei
    Guo, Yanming
    Liu, Yu
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
  • [6] Progressive Semantic Matching for Video-Text Retrieval
    Liu, Hongying
    Luo, Ruyi
    Shang, Fanhua
    Niu, Mantang
    Liu, Yuanyuan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5083 - 5091
  • [7] A Framework for Video-Text Retrieval with Noisy Supervision
    Vaseqi, Zahra
    Fan, Pengnan
    Clark, James
    Levine, Martin
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 373 - 383
  • [8] Deep learning for video-text retrieval: a review
    Cunjuan Zhu
    Qi Jia
    Wei Chen
    Yanming Guo
    Yu Liu
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [9] Visual Consensus Modeling for Video-Text Retrieval
    Cao, Shuqiang
    Wang, Bairui
    Zhang, Wei
    Ma, Lin
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 167 - 175
  • [10] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
    Ge, Yuying
    Ge, Yixiao
    Liu, Xihui
    Wang, Jinpeng
    Wu, Jianping
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708