Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

被引:0
|
作者
Liang, Mingliang [1 ]
Larson, Martha [1 ]
机构
[1] Radboud Univ Nijmegen, Nijmegen, Netherlands
关键词
Vision-language model; subsampling; frequent words; zero-shot image Classification;
D O I
10.1145/3607827.3616843
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce Subsampling of frequentWords for Contrastive Language-Image Pre-training (SW-CLIP), a novel approach for the training Vision-Language Models (VLMs). SW-CLIP uses frequency-based subsampling of words that has been previously proposed to train skip-gram models in natural language processing and applies it to the textual training data of VLMs. We report on experiments that demonstrate the ability of frequency-based subsampling to speed up training and also to deliver a substantial improvement in accuracy in a number of downstream zero-shot (i.e., transfer) classification tasks. We notice that the classification test sets on which SW-CLIP seems to be particularly effective are those in which the labels of the classes occur infrequently as words in the training data, and thus have a high probability of being retained during frequency-based subsampling of the model training data. Overall, the advantages of SW-CLIP demonstrated in this paper serves to motivated further future work in text subsampling for the training of VLMs. Our code and pre-trained weights are available at https://github.com/Anastasiais-ml/sw_clip.git
引用
收藏
页码:61 / 67
页数:7
相关论文
共 50 条
  • [41] Anatomical Structure-Guided Medical Vision-Language Pre-training
    Li, Qingqiu
    Yan, Xiaohan
    Xu, Jilan
    Yuan, Runtian
    Zhang, Yuejie
    Feng, Rui
    Shen, Quanli
    Zhang, Xiaobo
    Wang, Shujun
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 80 - 90
  • [42] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
    Zhang, Wenyu
    Shen, Li
    Foo, Chuan-Sheng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 844 - 866
  • [43] VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification
    Bakkali, Souhail
    Ming, Zuheng
    Coustaty, Mickael
    Rusinol, Marcal
    Ramos Terrades, Oriol
    PATTERN RECOGNITION, 2023, 139
  • [44] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
    Yao, Tao
    Peng, Shouyong
    Wang, Lili
    Li, Ying
    Sun, Yujuan
    APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
  • [45] COPA : Efficient Vision-Language Pre-training through Collaborative Object- and Patch-Text Alignment
    Jiang, Chaoya
    Xu, Haiyang
    Ye, Wei
    Ye, Qinghao
    Li, Chenliang
    Yan, Ming
    Bi, Bin
    Zhang, Shikun
    Huang, Fei
    Zhang, Ji
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4480 - 4491
  • [46] IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
    Huang, Xinyu
    Zhang, Youcai
    Cheng, Ying
    Tian, Weiwei
    Zhao, Ruiwei
    Feng, Rui
    Zhang, Yuejie
    Li, Yaqian
    Guo, Yandong
    Zhang, Xiaobo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4573 - 4583
  • [47] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search
    Zheng, Xiaoyang
    Wang, Zilong
    Li, Sen
    Xu, Ke
    Zhuang, Tao
    Liu, Qingwen
    Zeng, Xiaoyi
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 356 - 360
  • [48] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [49] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
  • [50] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
    Wang, Shengyi
    El-Gohary, Nora
    COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8