iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition

被引:5
|
作者
Wei, Yixuan [1 ,2 ]
Cao, Yue [2 ]
Zhang, Zheng [2 ]
Peng, Houwen [2 ]
Yao, Zhuliang [1 ,2 ]
Xie, Zhenda [1 ,2 ]
Hue, Han [1 ,2 ]
Guo, Baining [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.00272
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a method that effectively combines two prevalent visual recognition methods, i.e., image classification and contrastive language-image pre-training, dubbed iCLIP. Instead of naive multi-task learning that use two separate heads for each task, we fuse the two tasks in a deep fashion that adapts the image classification to share the same formula and the same model weights with the language-image pre-training. To further bridge these two tasks, we propose to enhance the category names in image classification tasks using external knowledge, such as their descriptions in dictionaries. Extensive experiments show that the proposed method combines the advantages of two tasks well: the strong discrimination ability in image classification tasks due to the clean category labels, and the good zero-shot ability in CLIP tasks ascribed to the richer semantics in the text descriptions. In particular, it reaches 82.9% top-1 accuracy on IN-1K, and meanwhile surpasses CLIP by 1.8%, with similar model size, on zero-shot recognition of Kornblith 12-dataset benchmark. The code and models are publicly available at https: //github.com/weiyx16/iCLIP.
引用
收藏
页码:2776 / 2786
页数:11
相关论文
共 50 条
  • [1] Contrastive Language-Image Pre-Training with Knowledge Graphs
    Pan, Xuran
    Ye, Tianzhu
    Han, Dongchen
    Song, Shiji
    Huang, Gao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] A closer look at the explainability of Contrastive language-image pre-training
    Li, Yi
    Wang, Hualiang
    Duan, Yiqun
    Zhang, Jiheng
    Li, Xiaomeng
    PATTERN RECOGNITION, 2025, 162
  • [3] UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
    Lee, Janghyeon
    Kim, Jongsuk
    Shon, Hyounguk
    Kim, Bumsoo
    Kim, Seung Hwan
    Lee, Honglak
    Kim, Junmo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Grounded Language-Image Pre-training
    Li, Liunian Harold
    Zhang, Pengchuan
    Zhang, Haotian
    Yang, Jianwei
    Li, Chunyuan
    Zhong, Yiwu
    Wang, Lijuan
    Yuan, Lu
    Zhang, Lei
    Hwang, Jenq-Neng
    Chang, Kai-Wei
    Gao, Jianfeng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10955 - 10965
  • [5] Using contrastive language-image pre-training for Thai recipe recommendation
    Chuenbanluesuk, Thanatkorn
    Plodprong, Voramate
    Karoon, Weerasak
    Rueangsri, Kotchakorn
    Pojam, Suthasinee
    Siriborvornratanakul, Thitirat
    LANGUAGE RESOURCES AND EVALUATION, 2025,
  • [6] Non-Contrastive Learning Meets Language-Image Pre-Training
    Zhou, Jinghao
    Dong, Li
    Gan, Zhe
    Wang, Lijuan
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11028 - 11038
  • [7] Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
    You, Haoxuan
    Zhou, Luowei
    Xiao, Bin
    Codella, Noel
    Cheng, Yu
    Xu, Ruochen
    Chang, Shih-Fu
    Yuan, Lu
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 69 - 87
  • [8] RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training
    Xie, Chen-Wei
    Sun, Siyang
    Xiong, Xiong
    Zheng, Yun
    Zhao, Deli
    Zhou, Jingren
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19265 - 19274
  • [9] Construction safety inspection with contrastive language-image pre-training (CLIP) image captioning and attention
    Tsai, Wei-Lun
    Le, Phuong-Linh
    Ho, Wang-Fat
    Chi, Nai-Wen
    Lin, Jacob J.
    Tang, Shuai
    Hsieh, Shang-Hsien
    AUTOMATION IN CONSTRUCTION, 2025, 169
  • [10] Centered Masking for Language-Image Pre-training
    Liang, Mingliang
    Larson, Martha
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-RESEARCH TRACK AND DEMO TRACK, PT VIII, ECML PKDD 2024, 2024, 14948 : 90 - 106