CLIP Based Multi-Event Representation Generation for Video-Text Retrieval

被引：0

作者：

Tu R. ^{[1
]}

Mao X. ^{[1
]}

Kong W. ^{[2
]}

Cai C. ^{[3
]}

Zhao W. ^{[4
]}

Wang H. ^{[5
]}

Huang H. ^{[1
]}

机构：

[1] Department of Computer Science and Technology, Beijing Institute of Technology, Beijing

[2] School of Information Engineering, Peking University, Guangdong, Shenzhen

[3] School of Electronic Information, Zhejiang University, Hangzhou

[4] School of Software, South China University of Technology, Guangzhou

[5] Institute of Automation, Chinese Academy of Sciences, Beijing

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2023年 / 60卷 / 09期

基金：

中国国家自然科学基金;

关键词：

CLIP model; event representation; pre-training model; Transformer model; video-text retrieval;

D O I：

10.7544/issn1000-1239.202220440

中图分类号：

学科分类号：

摘要：

Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks. © 2023 Science Press. All rights reserved.

引用

页码：2169 / 2179

页数：10

共 31 条

[1] Linchao Zhu, Yi Yang, ActBERT: Learning global-local video-text representations[C], Proc of the IEEE Conf on Computer Vision and Pattern Recognition, pp. 8746-8755, (2020)
[2] Huaishao Luo, Lei Ji, Botian Shi, Et al., Univl: A unified video and language pre-training model for multimodal understanding and generation [J], (2020)
[3] Linjie Li, Yen-Chun Chen, Cheng Yu, Et al., HERO: Hierarchical encoder for video+ language omni-representation pre-training [C], Proc of the Conf on Empirical Methods in Natural Language Processing, pp. 2046-2065, (2020)
[4] Gabeur V, Sun Chen, Alahari K, Et al., Multi-modal transformer for video retrieval[C], Proc of European Conf on Computer Vision, pp. 214-229, (2020)
[5] Patrick M, Huang Poyao, Asano Y, Et al., Support-set bottlenecks for video-text representation learning
[6] Rouditchenko A, Boggust A, Harwath D, Et al., Avlnet: Learning audiovisual language representations from instructional videos [J], (2006)
[7] Famin Wu, Guangyi Lu, Qi Liu, Et al., Deep semantic representation of time-sync comments for videos[J], Journal of Computer Research and Development, 56, 2, (2019)
[8] Fan Yang, Bin Xiao, Zhiwen Yu, Anomaly detection and modeling of surveillance video[J], Journal of Computer Research and Development, 58, 12, pp. 2708-2723, (2021)
[9] Haitao Yu, Xiaoshan Yang, Changsheng Xu, Antagonistic video generation method based on multimodal input[J], Journal of Computer Research and Development, 57, 7, (2020)
[10] Yujie Dian, Qin Jin, Audio-visual correlated multimodal concept detection[J], Journal of Computer Research and Development, 56, 5, (2019)

← 1 2 3 4 →