Joint embeddings with multimodal cues for video-text retrieval

被引:0
|
作者
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
机构
[1] University of California,
[2] Carnegie Mellon University,undefined
关键词
Video-text retrieval; Joint embedding; Multimodal cues;
D O I
暂无
中图分类号
学科分类号
摘要
For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
引用
收藏
页码:3 / 18
页数:15
相关论文
共 50 条
  • [31] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
    Choo, Sungkwon
    Ha, Seong Jong
    Lee, Joonsoo
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392
  • [32] STACKED CONVOLUTIONAL DEEP ENCODING NETWORK FOR VIDEO-TEXT RETRIEVAL
    Zhao, Rui
    Zheng, Kecheng
    Zha, Zheng-jun
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [33] Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
    Liu, Zhi
    Cai, Jincen
    Zhang, Mengmeng
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (07): : 2407 - 2424
  • [34] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [35] Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval
    Lin, Chengzhi
    Wu, Ancong
    Liang, Junwei
    Zhang, Jun
    Ge, Wenhang
    Zheng, Wei-Shi
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [36] Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features
    Nabati, Masoomeh
    Behrad, Alireza
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [37] LOOK, TELL AND MATCH: REFINING VIDEO-TEXT RETRIEVAL WITH SEMANTIC INFORMATION
    Zhu Jinkuan
    Hu Weiyi
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [38] Boosting Video-Text Retrieval with Explicit High-Level Semantics
    Wang, Haoran
    Xu, Di
    He, Dongliang
    Li, Fu
    Ji, Zhong
    Han, Jungong
    Ding, Errui
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898
  • [39] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
    Shi, Yaya
    Liu, Haowei
    Xu, Haiyang
    Ma, Zongyang
    Ye, Qinghao
    Hu, Anwen
    Yan, Ming
    Zhang, Ji
    Huang, Fei
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470
  • [40] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
    Chen, Shizhe
    Zhao, Yida
    Jin, Qin
    Wu, Qi
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10635 - 10644