Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

被引:27
|
作者
Wu, Wenhao [1 ,2 ]
Wang, Xiaohan [3 ]
Luo, Haipeng [4 ]
Wang, Jingdong [2 ]
Yang, Yi [3 ]
Ouyang, Wanli [1 ,5 ]
机构
[1] Univ Sydney, Camperdown, NSW, Australia
[2] Baidu Inc, Beijing, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
[5] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.00640
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE.
引用
收藏
页码:6620 / 6630
页数:11
相关论文
共 50 条
  • [1] Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
    Kawaharazuka, Kento
    Obinata, Yoshiki
    Kanazawa, Naoaki
    Okada, Kei
    Inaba, Masayuki
    2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
  • [2] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
    Zhang, Peng-Fei
    Huang, Zi
    Bai, Guangdong
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
  • [3] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
    Zheng, Kecheng
    Wu, Wei
    Feng, Ruili
    Zhu, Kai
    Liu, Jiawei
    Zhao, Deli
    Zha, Zheng-Jun
    Chen, Wei
    Shen, Yujun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
  • [4] CoCM: Conditional Cross-Modal Learning for Vision-Language Models
    Yang, Juncheng
    Xie, Shuai
    Li, Shuxia
    Cai, Zengyu
    Li, Yijia
    Zhu, Weiping
    ELECTRONICS, 2025, 14 (01):
  • [5] CPT: Colorful Prompt Tuning for pre-trained vision-language models
    Yao, Yuan
    Zhang, Ao
    Zhang, Zhengyan
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    AI OPEN, 2024, 5 : 30 - 38
  • [6] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [7] Cross-Modal Concept Learning and Inference for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Tang, Yushun
    He, Zhihai
    NEUROCOMPUTING, 2024, 583
  • [8] Cross-modal adapter for vision-language retrieval
    Jiang, Haojun
    Zhang, Jianke
    Huang, Rui
    Ge, Chunjiang
    Ni, Zanlin
    Song, Shiji
    Huang, Gao
    PATTERN RECOGNITION, 2025, 159
  • [9] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
    Wen, Keyu
    Xia, Jin
    Huang, Yuanyuan
    Li, Linyang
    Xu, Jiayan
    Shao, Jie
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2188 - 2197
  • [10] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
    Zhao, Xiaoqing
    Xu, Miaomiao
    Silamu, Wushour
    Li, Yanbing
    SENSORS, 2024, 24 (22)