BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval

被引:4
|
作者
Han, Ning [1 ]
Zeng, Yawen [2 ]
Shi, Chuhao [3 ]
Xiao, Guangyi [3 ]
Chen, Hao [3 ]
Chen, Jingjing [4 ]
机构
[1] Xiangtan Univ, Sch Comp Sci, Xiangtan 411105, Peoples R China
[2] Bytedance AI Lab, 43 North Third Ring West Rd, Beijing 100098, Peoples R China
[3] Hunan Univ, Coll Comp Sci & Elect Engn, 116 Lu Shan South Rd, Changsha 410082, Peoples R China
[4] Fudan Univ, Sch Comp Sci, 20 Handan Rd, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-video retrieval; spatio-temporal relation; bi-branch complementary network; IMAGE;
D O I
10.1145/3627103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] On the Importance of Spatio-Temporal Learning for Video Quality Assessment
    Fontanel, Dario
    Higham, David
    Vallade, Benoit Quentin Arthur
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 481 - 487
  • [32] Video representation learning by identifying spatio-temporal transformations
    Sheng Geng
    Shimin Zhao
    Hu Liu
    Applied Intelligence, 2022, 52 : 6613 - 6622
  • [33] Learning Spatio-Temporal Downsampling for Effective Video Upscaling
    Xiang, Xiaoyu
    Tian, Yapeng
    Rengarajan, Vijay
    Young, Lucas D.
    Zhu, Bo
    Ranjan, Rakesh
    COMPUTER VISION - ECCV 2022, PT XVIII, 2022, 13678 : 162 - 181
  • [34] Video representation learning by identifying spatio-temporal transformations
    Geng, Sheng
    Zhao, Shimin
    Liu, Hu
    APPLIED INTELLIGENCE, 2022, 52 (06) : 6613 - 6622
  • [35] Learning Spatio-Temporal Sharpness Map for Video Deblurring
    Zhu, Qi
    Zheng, Naishan
    Huang, Jie
    Zhou, Man
    Zhang, Jinghao
    Zhao, Feng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3957 - 3970
  • [36] Feature Pooling Using Spatio-Temporal Constrain for Video Summarization and Retrieval
    Ren, Jie
    Ren, Jinchang
    ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING: FUTURETECH & MUE, 2016, 393 : 381 - 387
  • [37] Incremetal spatio-temporal feature extraction and retrieval for large video database
    Geng, Bo
    Lu, Hong
    Xue, Xiangyang
    2007 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-11, 2007, : 961 - 964
  • [38] Spatio-Temporal Detail Information Retrieval for Compressed Video Quality Enhancement
    Luo, Dengyan
    Ye, Mao
    Li, Shuai
    Zhu, Ce
    Li, Xue
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6808 - 6820
  • [39] Motion trajectory clustering for video retrieval using spatio-temporal approximations
    Khalid, S
    Naftel, A
    VISUAL INFORMATION AND INFORMATION SYSTEMS, 2006, 3736 : 60 - 70
  • [40] Video representation and retrieval using spatio-temporal descriptors and region relations
    Chatzis, Sotirios
    Doulamis, Anastasios
    Kosmopoulos, Dimitrios
    Varvarigou, Theodora
    ARTIFICIAL NEURAL NETWORKS - ICANN 2006, PT 2, 2006, 4132 : 94 - 103