Fine-tuned CLIP Models are Efficient Video Learners

被引:51
|
作者
Rasheed, Hanoona [1 ]
Khattak, Muhammad Uzair [1 ]
Maaz, Muhammad [1 ]
Khan, Salman [1 ,2 ]
Khan, Fahad Shahbaz [1 ,3 ]
机构
[1] Mohamed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates
[2] Australian Natl Univ, Canberra, ACT, Australia
[3] Linkoping Univ, Linkoping, Sweden
关键词
D O I
10.1109/CVPR52729.2023.00633
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the framelevel processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a 'bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code and pre-trained models are available at https://github.com/muzairkhattak/ViFi-CLIP.
引用
收藏
页码:6545 / 6554
页数:10
相关论文
共 50 条
  • [41] Efficient human immunodeficiency virus replication requires a fine-tuned level of transcription
    Marzio, G
    Vink, M
    Verhoef, K
    de Ronde, A
    Berkhout, B
    JOURNAL OF VIROLOGY, 2002, 76 (06) : 3084 - 3088
  • [42] SUMLLAMA: Efficient Contrastive Representations and Fine-Tuned Adapters for Bug Report Summarization
    Xiang, Bangmeng
    Shao, Yunna
    IEEE ACCESS, 2024, 12 : 78562 - 78571
  • [43] LARGE SCALE FINE-TUNED TRANSFORMERS MODELS APPLICATION FOR BUSINESS NAMES GENERATION
    Lukauskas, Mantas
    Rasymas, Tomas
    Minelga, Matas
    Vaitmonas, Domas
    COMPUTING AND INFORMATICS, 2023, 42 (03) : 525 - 545
  • [44] Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment
    Talebi, Salmonn
    Tong, Elizabeth
    Li, Anna
    Yamin, Ghiam
    Zaharchuk, Greg
    Mofrad, Mohammad R. K.
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [45] Tissue and Tumor Epithelium Classification using Fine-tuned Deep CNN Models
    Anju, T. E.
    Vimala, S.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 306 - 314
  • [46] Deciphering language disturbances in schizophrenia: A study using fine-tuned language models
    Li, Renyu
    Cao, Minne
    Fu, Dawei
    Wei, Wei
    Wang, Dequan
    Yuan, Zhaoxia
    Hu, Ruofei
    Deng, Wei
    SCHIZOPHRENIA RESEARCH, 2024, 271 : 120 - 128
  • [47] NoiER: An Approach for Training More Reliable Fine-Tuned Downstream Task Models
    Jang, Myeongjun
    Lukasiewicz, Thomas
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2514 - 2525
  • [48] Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment
    Salmonn Talebi
    Elizabeth Tong
    Anna Li
    Ghiam Yamin
    Greg Zaharchuk
    Mohammad R. K. Mofrad
    BMC Medical Informatics and Decision Making, 24
  • [49] Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability
    Dombrowski, Mischa
    Reynaud, Hadrien
    Mueller, Johanna P.
    Baugh, Matthew
    Kainz, Bernhard
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21037 - 21045
  • [50] Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models
    Li, Na
    Kteich, Hanane
    Bouraoui, Zied
    Schockaert, Steven
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 216 - 226