Text-guided distillation learning to diversify video embeddings for text-video retrieval

被引：0

作者：

Lee, Sangmin ^{[1
]}

Kim, Hyung-Il ^{[2
]}

Ro, Yong Man ^{[3
]}

机构：

[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA

[2] Elect & Telecommun Res Inst, Visual Intelligence Res Sect, Daejeon 34129, South Korea

[3] Korea Adv Inst Sci & Technol, Image & Video Syst Lab, Daejeon 34141, South Korea

来源：

PATTERN RECOGNITION | 2024年 / 156卷

关键词：

text-video retrieval; Diverse video embedding; Text-guided distillation learning; Text-agnostic; One-to-many correspondence;

D O I：

10.1016/j.patcog.2024.110754

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multi-key memory to address temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms existing methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on diverse video embeddings.

引用

页数：10

共 50 条

[41] Text-video retrieval method based on enhanced self-attention and multi-task learning
Wu, Xiaoyu
Qian, Jiayao
Wang, Tiantian
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (16) : 24387 - 24406
[42] TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Liu, Yuqi
Xiong, Pengfei
Xu, Luhui
Cao, Shengming
Jin, Qin
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 319 - 335
[43] Text-guided visual representation learning for medical image retrieval systems
Serieys, Guillaume
Kurtz, Camille
Fournier, Laure
Cloppet, Florence
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 593 - 598
[44] Video and Text Matching with Conditioned Embeddings
Ali, Ameen
Schwartz, Idan
Hazan, Tamir
Wolf, Lior
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 478 - 487
[45] Text-Guided Knowledge Transfer for Remote Sensing Image-Text Retrieval
Liu, An-An
Yang, Bo
Li, Wenhui
Song, Dan
Sun, Zhengya
Ren, Tongwei
Wei, Zhiqiang
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
[46] CelebV-Text: A Large-Scale Facial Text-Video Dataset
Yu, Jianhui
Zhu, Hao
Jiang, Liming
Loy, Chen Change
Cai, Weidong
Wu, Wayne
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14805 - 14814
[47] Text-guided Graph Temporal Modeling for few-shot video classification
Deng, Fuqin
Zhong, Jiaming
Li, Nannan
Fu, Lanhui
Jiang, Bingchun
Yi, Ningbo
Qi, Feng
Xin, He
Lam, Tin Lun
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
[48] Text-Guided Object Detector for Multi-modal Video Question Answering
Shen, Ruoyue
Inoue, Nakamasa
Shinoda, Koichi
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1032 - 1042
[49] Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Jiang, Chen
Liu, Hong
Yu, Xuzheng
Wang, Qing
Cheng, Yuan
Xu, Jia
Liu, Zhongyi
Guo, Qingpei
Chu, Wei
Yang, Ming
Qi, Yuan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4626 - 4636
[50] Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
Han, Ning
Chen, Jingjing
Xiao, Guangyi
Zhang, Hao
Zeng, Yawen
Chen, Hao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3826 - 3834

← 1 2 3 4 5 →