Vision-Language Pre-Training with Triple Contrastive Learning

被引:110
|
作者
Yang, Jinyu [1 ,2 ]
Duan, Jiali [2 ]
Tran, Son [2 ]
Xu, Yi [2 ]
Chanda, Sampath [2 ]
Chen, Liqun [2 ]
Zeng, Belinda [2 ]
Chilimbi, Trishul [2 ]
Huang, Junzhou [1 ]
机构
[1] Univ Texas Arlington, Arlington, TX 76019 USA
[2] Amazon, Seattle, WA USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR52688.2022.01522
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However; simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.
引用
收藏
页码:15650 / 15659
页数:10
相关论文
共 50 条
  • [31] Position-guided Text Prompt for Vision-Language Pre-training
    Wang, Jinpeng
    Zhou, Pan
    Shou, Mike Zheng
    Yan, Shuicheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
  • [32] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Jin, Linbo
    Chen, Ben
    Zhou, Haoming
    Qiu, Minghui
    Shao, Ling
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652
  • [33] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
    Liu, Che
    Cheng, Sibo
    Shi, Miaojing
    Shah, Anand
    Bai, Wenjia
    Arcucci, Rossella
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (01) : 519 - 529
  • [34] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [35] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [36] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
    Mu, Yao
    Zhang, Qinglong
    Hu, Mengkang
    Wang, Wenhai
    Ding, Mingyu
    Jin, Jun
    Wang, Bin
    Dai, Jifeng
    Qiao, Yu
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [37] Fine-Grained Semantically Aligned Vision-Language Pre-Training
    Li, Juncheng
    He, Xin
    Wei, Longhui
    Qian, Long
    Zhu, Linchao
    Xie, Lingxi
    Zhuang, Yueting
    Tian, Qi
    Tang, Siliang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [38] Anatomical Structure-Guided Medical Vision-Language Pre-training
    Li, Qingqiu
    Yan, Xiaohan
    Xu, Jilan
    Yuan, Runtian
    Zhang, Yuejie
    Feng, Rui
    Shen, Quanli
    Zhang, Xiaobo
    Wang, Shujun
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 80 - 90
  • [39] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
    Huang, Zhicheng
    Zeng, Zhaoyang
    Huang, Yupan
    Liu, Bei
    Fu, Dongmei
    Fu, Jianlong
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12971 - 12980
  • [40] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
    Ma, Zhiyuan
    Li, Jianjun
    Li, Guohui
    Huang, Kaiyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4515 - 4524