Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引:0
|
作者
Fang, Han [1 ]
Yang, Zhifei [1 ]
Zang, Xianghao [1 ]
Ban, Chao [1 ]
He, Zhongjiang [1 ]
Sun, Hao [1 ]
Zhou, Lanxiang [1 ]
机构
[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China
关键词
Video-Text Retrieval; Mask Video Modeling; Attention;
D O I
10.1145/3581783.3611756
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.
引用
收藏
页码:3847 / 3856
页数:10
相关论文
共 50 条
  • [31] STACKED CONVOLUTIONAL DEEP ENCODING NETWORK FOR VIDEO-TEXT RETRIEVAL
    Zhao, Rui
    Zheng, Kecheng
    Zha, Zheng-jun
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [32] Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
    Liu, Zhi
    Cai, Jincen
    Zhang, Mengmeng
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (07): : 2407 - 2424
  • [33] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [34] Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval
    Lin, Chengzhi
    Wu, Ancong
    Liang, Junwei
    Zhang, Jun
    Ge, Wenhang
    Zheng, Wei-Shi
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [35] LOOK, TELL AND MATCH: REFINING VIDEO-TEXT RETRIEVAL WITH SEMANTIC INFORMATION
    Zhu Jinkuan
    Hu Weiyi
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [36] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
    Chen, Shizhe
    Zhao, Yida
    Jin, Qin
    Wu, Qi
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10635 - 10644
  • [37] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [38] KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
    Zhuang, Xianwei
    Li, Hongxiang
    Cheng, Xuxin
    Zhu, Zhihong
    Xie, Yuxin
    Zou, Yuexian
    COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 313 - 331
  • [39] EA-VTR: Event-Aware Video-Text Retrieval
    Ma, Zongyang
    Zhang, Ziqi
    Chen, Yuxin
    Qi, Zhongang
    Yuan, Chunfeng
    Li, Bing
    Luo, Yingmin
    Li, Xu
    Qi, Xiaojuan
    Shan, Ying
    Hu, Weiming
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 76 - 94
  • [40] Debiased Video-Text Retrieval via Soft Positive Sample Calibration
    Zhang, Huaiwen
    Yang, Yang
    Qi, Fan
    Qian, Shengsheng
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5257 - 5270