BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval

被引:4
|
作者
Han, Ning [1 ]
Zeng, Yawen [2 ]
Shi, Chuhao [3 ]
Xiao, Guangyi [3 ]
Chen, Hao [3 ]
Chen, Jingjing [4 ]
机构
[1] Xiangtan Univ, Sch Comp Sci, Xiangtan 411105, Peoples R China
[2] Bytedance AI Lab, 43 North Third Ring West Rd, Beijing 100098, Peoples R China
[3] Hunan Univ, Coll Comp Sci & Elect Engn, 116 Lu Shan South Rd, Changsha 410082, Peoples R China
[4] Fudan Univ, Sch Comp Sci, 20 Handan Rd, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-video retrieval; spatio-temporal relation; bi-branch complementary network; IMAGE;
D O I
10.1145/3627103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval
    Li, Pandeng
    Xie, Chen-Wei
    Zhao, Liming
    Xie, Hongtao
    Ge, Jiannan
    Zheng, Yun
    Zhao, Deli
    Zhang, Yongdong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 4077 - 4087
  • [2] Learning Linguistic Association Towards Efficient Text-Video Retrieval
    Fang, Sheng
    Wang, Shuhui
    Zhuo, Junbao
    Han, Xinzhe
    Huang, Qingming
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 254 - 270
  • [3] An efficient approach for video retrieval by spatio-temporal features
    Kumar, G. S. Naveen
    Reddy, V. S. K.
    INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2019, 23 (04) : 311 - 316
  • [4] KnowER: Knowledge enhancement for efficient text-video retrieval
    Kou H.
    Yang Y.
    Hua Y.
    Intelligent and Converged Networks, 2023, 4 (02): : 93 - 105
  • [5] CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
    Zhao, Shuai
    Zhu, Linchao
    Wang, Xiaohan
    Yang, Yi
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 970 - 981
  • [6] Spatio-Temporal feature based VLAD for efficient Video retrieval
    Reddy, Mopuri K.
    Arora, Sahil
    Babu, R. Venkatesh
    2013 FOURTH NATIONAL CONFERENCE ON COMPUTER VISION, PATTERN RECOGNITION, IMAGE PROCESSING AND GRAPHICS (NCVPRIPG), 2013,
  • [7] Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
    Deng, Chaorui
    Chen, Qi
    Qin, Pengda
    Chen, Da
    Wu, Qi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15602 - 15612
  • [8] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
    Ji, Kaixiang
    Liu, Jiajia
    Hong, Weixiang
    Zhong, Liheng
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
  • [9] Text-guided distillation learning to diversify video embeddings for text-video retrieval
    Lee, Sangmin
    Kim, Hyung-Il
    Ro, Yong Man
    PATTERN RECOGNITION, 2024, 156
  • [10] A spatio-temporal pyramid matching for video retrieval
    Choi, Jaesik
    Wang, Ziyu
    Lee, Sang-Chul
    Jeon, Won J.
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2013, 117 (06) : 660 - 669