Vision-Language Pre-Training with Triple Contrastive Learning

被引:110
|
作者
Yang, Jinyu [1 ,2 ]
Duan, Jiali [2 ]
Tran, Son [2 ]
Xu, Yi [2 ]
Chanda, Sampath [2 ]
Chen, Liqun [2 ]
Zeng, Belinda [2 ]
Chilimbi, Trishul [2 ]
Huang, Junzhou [1 ]
机构
[1] Univ Texas Arlington, Arlington, TX 76019 USA
[2] Amazon, Seattle, WA USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR52688.2022.01522
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However; simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.
引用
收藏
页码:15650 / 15659
页数:10
相关论文
共 50 条
  • [21] Towards Adversarial Attack on Vision-Language Pre-training Models
    Zhang, Jiaming
    Yi, Qi
    Sang, Jitao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
  • [22] MAFA: Managing False Negatives for Vision-Language Pre-training
    Byun, Jaeseok
    Kim, Dohoon
    Moon, Taesup
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27304 - 27314
  • [23] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
    Zhou, Wenlve
    Zhou, Zhiheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
  • [24] Multimodal Pre-training Method for Vision-language Understanding and Generation
    Liu T.-Y.
    Wu Z.-X.
    Chen J.-J.
    Jiang Y.-G.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2024 - 2034
  • [25] Unified Vision-Language Pre-Training for Image Captioning and VQA
    Zhou, Luowei
    Palangi, Hamid
    Zhang, Lei
    Hu, Houdong
    Corso, Jason J.
    Gao, Jianfeng
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13041 - 13049
  • [26] Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
    Ji, Yatai
    Tu, Rongcheng
    Jiang, Jie
    Kong, Weijie
    Cai, Chengfei
    Zhao, Wenzhe
    Wang, Hongfa
    Yang, Yujiu
    Liu, Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6789 - 6798
  • [27] Noise-Robust Vision-Language Pre-Training With Positive-Negative Learning
    Huang, Zhenyu
    Yang, Mouxing
    Xiao, Xinyan
    Hu, Peng
    Peng, Xi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (01) : 338 - 350
  • [28] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
    Jiang, Chaoya
    Ye, Wei
    Xu, Haiyang
    Huang, Songfang
    Huang, Fei
    Zhang, Shikun
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
  • [29] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
    Dou, Zi-Yi
    Kamath, Aishwarya
    Gan, Zhe
    Zhang, Pengchuan
    Wang, Jianfeng
    Li, Linjie
    Liu, Zicheng
    Liu, Ce
    LeCun, Yann
    Peng, Nanyun
    Gao, Jianfeng
    Wang, Lijuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [30] Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends
    Gan, Zhe
    Li, Linjie
    Li, Chunyuan
    Wang, Lijuan
    Liu, Zicheng
    Gao, Jianfeng
    FOUNDATIONS AND TRENDS IN COMPUTER GRAPHICS AND VISION, 2022, 14 (3-4): : 163 - 352