Joint embeddings with multimodal cues for video-text retrieval

被引:0
|
作者
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
机构
[1] University of California,
[2] Carnegie Mellon University,undefined
关键词
Video-text retrieval; Joint embedding; Multimodal cues;
D O I
暂无
中图分类号
学科分类号
摘要
For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
引用
收藏
页码:3 / 18
页数:15
相关论文
共 50 条
  • [41] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [42] KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval
    Zhuang, Xianwei
    Li, Hongxiang
    Cheng, Xuxin
    Zhu, Zhihong
    Xie, Yuxin
    Zou, Yuexian
    COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 313 - 331
  • [43] EA-VTR: Event-Aware Video-Text Retrieval
    Ma, Zongyang
    Zhang, Ziqi
    Chen, Yuxin
    Qi, Zhongang
    Yuan, Chunfeng
    Li, Bing
    Luo, Yingmin
    Li, Xu
    Qi, Xiaojuan
    Shan, Ying
    Hu, Weiming
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 76 - 94
  • [44] Debiased Video-Text Retrieval via Soft Positive Sample Calibration
    Zhang, Huaiwen
    Yang, Yang
    Qi, Fan
    Qian, Shengsheng
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5257 - 5270
  • [45] Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
    Wang, Yimu
    Shi, Peng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 633 - 649
  • [46] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [47] Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
    Ma, Wufei
    Li, Kai
    Jiang, Zhongshi
    Meshry, Moustafa
    Liu, Qihao
    Wang, Huiyu
    Hane, Christian
    Yuille, Alan
    COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 254 - 269
  • [48] Self-expressive induced clustered attention for video-text retrieval
    Zhu, Jingxuan
    Shen, Xiangjun
    Mehta, Sumet
    Abeo, Timothy Apasiba
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [49] Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
    Jin, Lu
    Li, Zechao
    Tang, Jinhui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1838 - 1851
  • [50] Video-text extraction and recognition
    Chen, TB
    Ghosh, D
    Ranganath, S
    TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A319 - A322