Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

被引:27
|
作者
Wu, Wenhao [1 ,2 ]
Wang, Xiaohan [3 ]
Luo, Haipeng [4 ]
Wang, Jingdong [2 ]
Yang, Yi [3 ]
Ouyang, Wanli [1 ,5 ]
机构
[1] Univ Sydney, Camperdown, NSW, Australia
[2] Baidu Inc, Beijing, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
[5] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.00640
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE.
引用
收藏
页码:6620 / 6630
页数:11
相关论文
共 50 条
  • [31] VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification
    Bakkali, Souhail
    Ming, Zuheng
    Coustaty, Mickael
    Rusinol, Marcal
    Ramos Terrades, Oriol
    PATTERN RECOGNITION, 2023, 139
  • [32] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
    Ma, Zhiyuan
    Li, Jianjun
    Li, Guohui
    Huang, Kaiyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524
  • [33] Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-Box Optimization
    Kawaharazuka, Kento
    Kanazawa, Naoaki
    Obinata, Yoshiki
    Okada, Kei
    Inaba, Masayuki
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (05) : 4059 - 4066
  • [34] Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval
    Sun, Lina
    Li, Yewen
    Dong, Yumin
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 499 - 507
  • [35] Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching
    Kim, Byoungjip
    Choi, Sungik
    Hwang, Dasol
    Lee, Moontae
    Lee, Honglak
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [36] Probing Pre-Trained Language Models for Disease Knowledge
    Alghanmi, Israa
    Espinosa-Anke, Luis
    Schockaert, Steven
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3023 - 3033
  • [37] Dynamic Knowledge Distillation for Pre-trained Language Models
    Li, Lei
    Lin, Yankai
    Ren, Shuhuai
    Li, Peng
    Zhou, Jie
    Sun, Xu
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 379 - 389
  • [38] A Survey of Knowledge Enhanced Pre-Trained Language Models
    Hu, Linmei
    Liu, Zeyi
    Zhao, Ziwang
    Hou, Lei
    Nie, Liqiang
    Li, Juanzi
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (04) : 1413 - 1430
  • [39] Commonsense Knowledge Transfer for Pre-trained Language Models
    Zhou, Wangchunshu
    Le Bras, Ronan
    Choi, Yejin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5946 - 5960
  • [40] Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
    Pandey, Rohan
    Shao, Rulin
    Liang, Paul Pu
    Salakhutdinov, Ruslan
    Morency, Louis-Philippe
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5444 - 5455