Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

被引:0
|
作者
Tian, Kaibin [1 ]
Cheng, Yanhua [1 ]
Liu, Yi [1 ]
Hou, Xinglin [1 ]
Chen, Quan [1 ]
Li, Han [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
关键词
CLIP;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
引用
收藏
页码:5207 / 5214
页数:8
相关论文
共 50 条
  • [31] Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search
    Huang, Wenxin
    Jia, Xuemei
    Zhong, Xian
    Wang, Xiao
    Jiang, Kui
    Wang, Zheng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
  • [32] Robot learning through observation via coarse-to-fine grained video summarization
    Zhang, Yujia
    Li, Qianzhong
    Zhao, Xiaoguang
    Tan, Min
    APPLIED SOFT COMPUTING, 2021, 99
  • [33] CtF: Mitigating Visual Confusion in Continual Learning Through a Coarse-To-Fine Screening
    Ye, Zejun
    Zhao, Defeng
    Zhang, Wentao
    Wang, Ruixuan
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VI, ICIC 2024, 2024, 14867 : 134 - 146
  • [34] An Efficient Deep Learning Based Coarse-to-Fine Cephalometric Landmark Detection Method
    Song, Yu
    Qiao, Xu
    Iwamoto, Yutaro
    Chen, Yen-Wei
    Chen, Yili
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (08) : 1359 - 1366
  • [35] Coarse-to-fine processing drives the efficient coding of natural scenes in mouse visual cortex
    Skyberg, Rolf
    Tanabe, Seiji
    Chen, Hui
    Cang, Jianhua
    CELL REPORTS, 2022, 38 (13):
  • [36] Fast coarse-to-fine video retrieval using shot-level spatio-temporal statistics
    Ho, Yu-Hsuan
    Lin, Chia-Wen
    Chen, Jing-Fung
    Liao, Hong-Yuan Mark
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2006, 16 (05) : 642 - 648
  • [37] Pattern Retrieval in Large Image Databases Using Multiscale Coarse-to-Fine Cascaded Active Learning
    Blanchart, Pierre
    Ferecatu, Marin
    Cui, Shiyong
    Datcu, Mihai
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2014, 7 (04) : 1127 - 1141
  • [38] Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning
    Ly, Ngoc Q.
    Do, Tuong K.
    Nguyen, Binh X.
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2019, 2019
  • [39] Visual location recognition based on coarse-to-fine image retrieval and epipolar geometry constraint for urban environment
    Feng, Guanyuan
    Ma, Lin
    Tan, Xuezhi
    Xue, Hao
    Guan, Kai
    International Journal of Signal Processing, Image Processing and Pattern Recognition, 2016, 9 (11) : 25 - 36
  • [40] Attention-based cropping and erasing learning with coarse-to-fine refinement for fine-grained visual classification
    Chen, Jianpin
    Li, Heng
    Liang, Junlin
    Su, Xiaofan
    Zhai, Zhenzhen
    Chai, Xinyu
    NEUROCOMPUTING, 2022, 501 : 359 - 369