BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval

被引:4
|
作者
Han, Ning [1 ]
Zeng, Yawen [2 ]
Shi, Chuhao [3 ]
Xiao, Guangyi [3 ]
Chen, Hao [3 ]
Chen, Jingjing [4 ]
机构
[1] Xiangtan Univ, Sch Comp Sci, Xiangtan 411105, Peoples R China
[2] Bytedance AI Lab, 43 North Third Ring West Rd, Beijing 100098, Peoples R China
[3] Hunan Univ, Coll Comp Sci & Elect Engn, 116 Lu Shan South Rd, Changsha 410082, Peoples R China
[4] Fudan Univ, Sch Comp Sci, 20 Handan Rd, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-video retrieval; spatio-temporal relation; bi-branch complementary network; IMAGE;
D O I
10.1145/3627103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Interactive spatio-temporal visual map model for web video retrieval
    Luan, Huan-Bo
    Lin, Shou-Xun
    Tang, Sheng
    Neo, Shi-Yong
    Chua, Tat-Seng
    2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 560 - +
  • [42] Conditional deep clustering based transformed spatio-temporal features and fused distance for efficient video retrieval
    Banerjee A.
    Kumar E.
    Ravinder M.
    International Journal of Information Technology, 2023, 15 (5) : 2349 - 2355
  • [43] Efficient spatio-temporal decomposition for perceptual processing of video sequences
    Lindh, P
    Lambrecht, CJVB
    INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, PROCEEDINGS - VOL III, 1996, : 331 - 334
  • [44] Deep Spatio-Temporal Random Fields for Efficient Video Segmentation
    Chandra, Siddhartha
    Couprie, Camille
    Kokkinos, Iasonas
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8915 - 8924
  • [45] Efficient Online Spatio-Temporal Filtering for Video Event Detection
    Yan, Xinchen
    Yuan, Junsong
    Liang, Hui
    COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I, 2015, 8925 : 769 - 785
  • [46] Efficient Motion Weighted Spatio-Temporal Video SSIM Index
    Moorthy, Anush K.
    Bovik, Alan C.
    HUMAN VISION AND ELECTRONIC IMAGING XV, 2010, 7527
  • [47] An Empirical Investigation of Efficient Spatio-Temporal Modeling in Video Restoration
    Fan, Yuchen
    Yu, Jiahui
    Liu, Ding
    Huang, Thomas S.
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 2159 - 2168
  • [48] Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net
    Liu, Tianrui
    Meng, Qingjie
    Huang, Jun-Jie
    Vlontzos, Athanasios
    Rueckert, Daniel
    Kainz, Bernhard
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1573 - 1586
  • [49] STEP: Spatio-Temporal Progressive Learning for Video Action Detection
    Yang, Xitong
    Yang, Xiaodong
    Liu, Ming-Yu
    Xiao, Fanyi
    Davis, Larry
    Kautz, Jan
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 264 - 272
  • [50] Learning Spatio-temporal Representation by Channel Aliasing Video Perception
    Lin, Yiqi
    Wang, Jinpeng
    Zhang, Manlin
    Ma, Andy J.
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2317 - 2325