Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

被引:1
|
作者
Wang, Tzu-Jui Julius [1 ]
Laaksonen, Jorma [1 ]
Langer, Tomas [2 ]
Arponen, Heikki [2 ,3 ]
Bishop, Tom E. [2 ,4 ]
机构
[1] Aalto Univ, Espoo, Finland
[2] Intuit Machines Inc, San Francisco, CA USA
[3] Systemat Alpha, Sunny Isles Beach, FL USA
[4] Glass Imaging, Los Altos, CA USA
基金
芬兰科学院;
关键词
SEGMENTATION;
D O I
10.1109/WACV56688.2023.00113
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities. Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB), over a variety of V-L tasks, i.e. XMR, Visual Question Answering, etc. Notably, benchmarked with recall@{1,5,10}, it consistently improves U-VB on image-to-text and text-to-image retrieval on two popular datasets Flickr30K and MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization tests on these XMR tasks. Moreover, in other V-L downstream tasks considered, our WFH models are on par with models trained with paired V-L data, revealing the utility of unpaired data. These results demonstrate greater generalization of the proposed W-VLP model with WFH.
引用
收藏
页码:1073 / 1083
页数:11
相关论文
共 50 条
  • [41] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [42] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
  • [43] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    Xu, Hanwen
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
  • [44] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
  • [45] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
    Wang, Shengyi
    El-Gohary, Nora
    COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
  • [46] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    arXiv, 2023,
  • [47] Leveraging per Image-Token Consistency for Vision-Language Pre-training
    Gou, Yunhao
    Ko, Tom
    Yang, Hansi
    Kwok, James
    Zhang, Yu
    Wang, Mingxuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
  • [48] Enhancing medical text detection with vision-language pre-training and efficient segmentation
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3995 - 4007
  • [49] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
    Hong, Weixiang
    Ji, Kaixiang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
  • [50] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, : 5120 - 5131