Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

被引:27
|
作者
Wu, Wenhao [1 ,2 ]
Wang, Xiaohan [3 ]
Luo, Haipeng [4 ]
Wang, Jingdong [2 ]
Yang, Yi [3 ]
Ouyang, Wanli [1 ,5 ]
机构
[1] Univ Sydney, Camperdown, NSW, Australia
[2] Baidu Inc, Beijing, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
[5] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.00640
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE.
引用
收藏
页码:6620 / 6630
页数:11
相关论文
共 50 条
  • [21] CLIPose: Category-Level Object Pose Estimation With Pre-Trained Vision-Language Knowledge
    Lin, Xiao
    Zhu, Minghao
    Dang, Ronghao
    Zhou, Guangliang
    Shu, Shaolong
    Lin, Feng
    Liu, Chengju
    Chen, Qijun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9125 - 9138
  • [22] Constraint embedding for prompt tuning in vision-language pre-trained model
    Cheng, Keyang
    Wei, Liutao
    Tang, Jingfeng
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [23] OPEN-VOCABULARY SKELETON ACTION RECOGNITION WITH DIFFUSION GRAPH CONVOLUTIONAL NETWORK AND PRE-TRAINED VISION-LANGUAGE MODELS
    Wei, Chao
    Deng, Zhidong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3195 - 3199
  • [24] Knowledge Rumination for Pre-trained Language Models
    Yao, Yunzhi
    Wang, Peng
    Mao, Shengyu
    Tan, Chuanqi
    Huang, Fei
    Chen, Huajun
    Zhang, Ningyu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3387 - 3404
  • [25] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
    Xing, Yinghui
    Wu, Qirui
    Cheng, De
    Zhang, Shizhou
    Liang, Guoqiang
    Wang, Peng
    Zhang, Yanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
  • [26] Knowledge Inheritance for Pre-trained Language Models
    Qin, Yujia
    Lin, Yankai
    Yi, Jing
    Zhang, Jiajie
    Han, Xu
    Zhang, Zhengyan
    Su, Yusheng
    Liu, Zhiyuan
    Li, Peng
    Sun, Maosong
    Zhou, Jie
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937
  • [27] Vision-Language Knowledge Exploration for Video Saliency Prediction
    Zhou, Fei
    Huang, Baitao
    Qiu, Guoping
    PATTERN RECOGNITION AND COMPUTER VISION, PT IX, PRCV 2024, 2025, 15039 : 191 - 205
  • [28] Cross-Modal Retrieval Algorithm for Image and Text Based on Pre-Trained Models and Encoders
    Chen X.
    Peng J.
    Zhang P.
    Luo Z.
    Ou Z.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2023, 46 (05): : 112 - 117
  • [29] ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense
    Zhou, Kankan
    Lai, Eason
    Yeong, Wei Bin Au
    Mouratidis, Kyriakos
    Jiang, Jing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10185 - 10197
  • [30] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
    Keyang Cheng
    Liutao Wei
    Jingfeng Tang
    Yongzhao Zhan
    Multimedia Systems, 2025, 31 (1)