Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引:0
|
作者
Fang, Han [1 ]
Yang, Zhifei [1 ]
Zang, Xianghao [1 ]
Ban, Chao [1 ]
He, Zhongjiang [1 ]
Sun, Hao [1 ]
Zhou, Lanxiang [1 ]
机构
[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China
关键词
Video-Text Retrieval; Mask Video Modeling; Attention;
D O I
10.1145/3581783.3611756
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.
引用
收藏
页码:3847 / 3856
页数:10
相关论文
共 50 条
  • [41] Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
    Wang, Yimu
    Shi, Peng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 633 - 649
  • [42] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [43] Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
    Ma, Wufei
    Li, Kai
    Jiang, Zhongshi
    Meshry, Moustafa
    Liu, Qihao
    Wang, Huiyu
    Hane, Christian
    Yuille, Alan
    COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 254 - 269
  • [44] Self-expressive induced clustered attention for video-text retrieval
    Zhu, Jingxuan
    Shen, Xiangjun
    Mehta, Sumet
    Abeo, Timothy Apasiba
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [45] Video-text extraction and recognition
    Chen, TB
    Ghosh, D
    Ranganath, S
    TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A319 - A322
  • [46] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
    Fang, Han
    Xiong, Pengfei
    Xu, Luhui
    Luo, Wenhan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
  • [47] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
    Chen, Lei
    Deng, Zhen
    Liu, Libo
    Yin, Shibai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
  • [48] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
    Shu, Fangxun
    Chen, Biaolong
    Liao, Yue
    Wang, Jinqiao
    Liu, Si
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
  • [49] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval
    Fang, Sheng
    Wang, Shuhui
    Zhuo, Junbao
    Huang, Qingming
    Ma, Bin
    Wei, Xiaoming
    Wei, Xiaolin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4789 - 4800
  • [50] FeatInter: Exploring fine-grained object features for video-text retrieval
    Liu, Baolong
    Zheng, Qi
    Wang, Yabing
    Zhang, Minsong
    Dong, Jianfeng
    Wang, Xun
    NEUROCOMPUTING, 2022, 496 : 178 - 191