Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引：0

作者：

Fang, Han ^{[1
]}

Yang, Zhifei ^{[1
]}

Zang, Xianghao ^{[1
]}

Ban, Chao ^{[1
]}

He, Zhongjiang ^{[1
]}

Sun, Hao ^{[1
]}

Zhou, Lanxiang ^{[1
]}

机构：

[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Video-Text Retrieval; Mask Video Modeling; Attention;

D O I：

10.1145/3581783.3611756

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.

引用

页码：3847 / 3856

页数：10

共 50 条

[1] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
Shi, Yaya
Liu, Haowei
Xu, Haiyang
Ma, Zongyang
Ye, Qinghao
Hu, Anwen
Yan, Ming
Zhang, Ji
Huang, Fei
Yuan, Chunfeng
Li, Bing
Hu, Weiming
Zha, Zheng-Jun
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470
[2] Boosting Video-Text Retrieval with Explicit High-Level Semantics
Wang, Haoran
Xu, Di
He, Dongliang
Li, Fu
Ji, Zhong
Han, Jungong
Ding, Errui
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898
[3] Multi-event Video-Text Retrieval
Zhang, Gengyuan
Ren, Jisen
Gu, Jindong
Tresp, Volker
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
[4] A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL
Li, Zheng
Guo, Caili
Yang, Bo
Feng, Zerun
Zhang, Hao
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[5] Deep learning for video-text retrieval: a review
Zhu, Cunjuan
Jia, Qi
Chen, Wei
Guo, Yanming
Liu, Yu
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
[6] Progressive Semantic Matching for Video-Text Retrieval
Liu, Hongying
Luo, Ruyi
Shang, Fanhua
Niu, Mantang
Liu, Yuanyuan
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5083 - 5091
[7] A Framework for Video-Text Retrieval with Noisy Supervision
Vaseqi, Zahra
Fan, Pengnan
Clark, James
Levine, Martin
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 373 - 383
[8] Deep learning for video-text retrieval: a review
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
International Journal of Multimedia Information Retrieval, 2023, 12
[9] Visual Consensus Modeling for Video-Text Retrieval
Cao, Shuqiang
Wang, Bairui
Zhang, Wei
Ma, Lin
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 167 - 175
[10] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
Ge, Yuying
Ge, Yixiao
Liu, Xihui
Wang, Jinpeng
Wu, Jianping
Shan, Ying
Qie, Xiaohu
Luo, Ping
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708

← 1 2 3 4 5 →