Joint embeddings with multimodal cues for video-text retrieval

被引:0
|
作者
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
机构
[1] University of California,
[2] Carnegie Mellon University,undefined
关键词
Video-text retrieval; Joint embedding; Multimodal cues;
D O I
暂无
中图分类号
学科分类号
摘要
For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
引用
收藏
页码:3 / 18
页数:15
相关论文
共 50 条
  • [1] Joint embeddings with multimodal cues for video-text retrieval
    Mithun, Niluthpol C.
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
  • [2] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 19 - 27
  • [3] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
    Liu, Zhi
    Zhao, Fangyuan
    Zhang, Mengmeng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
  • [4] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
  • [5] CLIP2TF:Multimodal video-text retrieval for adolescent education
    Sun, Xiaoning
    Fan, Tao
    Li, Hongxu
    Wang, Guozhong
    Ge, Peien
    Shang, Xiwu
    DISPLAYS, 2024, 84
  • [6] Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval
    Feng, Zerun
    Zeng, Zhimin
    Guo, Caili
    Li, Zheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1438 - 1453
  • [7] Multi-event Video-Text Retrieval
    Zhang, Gengyuan
    Ren, Jisen
    Gu, Jindong
    Tresp, Volker
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
  • [8] A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL
    Li, Zheng
    Guo, Caili
    Yang, Bo
    Feng, Zerun
    Zhang, Hao
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [9] Deep learning for video-text retrieval: a review
    Zhu, Cunjuan
    Jia, Qi
    Chen, Wei
    Guo, Yanming
    Liu, Yu
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
  • [10] Progressive Semantic Matching for Video-Text Retrieval
    Liu, Hongying
    Luo, Ruyi
    Shang, Fanhua
    Niu, Mantang
    Liu, Yuanyuan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5083 - 5091