Fine-tuned CLIP Models are Efficient Video Learners

被引：51

作者：

Rasheed, Hanoona ^{[1
]}

Khattak, Muhammad Uzair ^{[1
]}

Maaz, Muhammad ^{[1
]}

Khan, Salman ^{[1
,2
]}

Khan, Fahad Shahbaz ^{[1
,3
]}

机构：

[1] Mohamed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates

[2] Australian Natl Univ, Canberra, ACT, Australia

[3] Linkoping Univ, Linkoping, Sweden

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00633

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the framelevel processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a 'bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code and pre-trained models are available at https://github.com/muzairkhattak/ViFi-CLIP.

引用

页码：6545 / 6554

页数：10

共 50 条

[41] Efficient human immunodeficiency virus replication requires a fine-tuned level of transcription
Marzio, G
Vink, M
Verhoef, K
de Ronde, A
Berkhout, B
JOURNAL OF VIROLOGY, 2002, 76 (06) : 3084 - 3088
[42] SUMLLAMA: Efficient Contrastive Representations and Fine-Tuned Adapters for Bug Report Summarization
Xiang, Bangmeng
Shao, Yunna
IEEE ACCESS, 2024, 12 : 78562 - 78571
[43] LARGE SCALE FINE-TUNED TRANSFORMERS MODELS APPLICATION FOR BUSINESS NAMES GENERATION
Lukauskas, Mantas
Rasymas, Tomas
Minelga, Matas
Vaitmonas, Domas
COMPUTING AND INFORMATICS, 2023, 42 (03) : 525 - 545
[44] Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment
Talebi, Salmonn
Tong, Elizabeth
Li, Anna
Yamin, Ghiam
Zaharchuk, Greg
Mofrad, Mohammad R. K.
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
[45] Tissue and Tumor Epithelium Classification using Fine-tuned Deep CNN Models
Anju, T. E.
Vimala, S.
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 306 - 314
[46] Deciphering language disturbances in schizophrenia: A study using fine-tuned language models
Li, Renyu
Cao, Minne
Fu, Dawei
Wei, Wei
Wang, Dequan
Yuan, Zhaoxia
Hu, Ruofei
Deng, Wei
SCHIZOPHRENIA RESEARCH, 2024, 271 : 120 - 128
[47] NoiER: An Approach for Training More Reliable Fine-Tuned Downstream Task Models
Jang, Myeongjun
Lukasiewicz, Thomas
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2514 - 2525
[48] Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment
Salmonn Talebi
Elizabeth Tong
Anna Li
Ghiam Yamin
Greg Zaharchuk
Mohammad R. K. Mofrad
BMC Medical Informatics and Decision Making, 24
[49] Trade-Offs in Fine-Tuned Diffusion Models between Accuracy and Interpretability
Dombrowski, Mischa
Reynaud, Hadrien
Mueller, Johanna P.
Baugh, Matthew
Kainz, Bernhard
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21037 - 21045
[50] Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models
Li, Na
Kteich, Hanane
Bouraoui, Zied
Schockaert, Steven
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 216 - 226

← 1 2 3 4 5 →