Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

被引:0
|
作者
Liu, Zikang [1 ]
Chen, Sihan [1 ]
Guo, Longteng [1 ]
Li, Handong [1 ]
He, Xingjian [1 ]
Liu, Jing [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-Language Pre-Training; Pre-Training Data Generation;
D O I
10.1145/3581783.3612388
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration ( JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset are available at https://github.com/johncaged/OPT_Questioner.
引用
收藏
页码:5120 / 5131
页数:12
相关论文
共 50 条
  • [41] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    Xu, Hanwen
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
  • [42] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
  • [43] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
    Wang, Shengyi
    El-Gohary, Nora
    COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
  • [44] Leveraging per Image-Token Consistency for Vision-Language Pre-training
    Gou, Yunhao
    Ko, Tom
    Yang, Hansi
    Kwok, James
    Zhang, Yu
    Wang, Mingxuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
  • [45] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
    Hong, Weixiang
    Ji, Kaixiang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
  • [46] Multimodal detection of hateful memes by applying a vision-language pre-training model
    Chen, Yuyang
    Pan, Feng
    PLOS ONE, 2022, 17 (09):
  • [47] Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
    Dai, Wenliang
    Liu, Zihan
    Ji, Ziwei
    Su, Dan
    Fung, Pascale
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2136 - 2148
  • [48] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
    Yao, Tao
    Peng, Shouyong
    Wang, Lili
    Li, Ying
    Sun, Yujuan
    APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
  • [49] GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
    Yin, Da
    Gao, Feng
    Thattai, Govind
    Johnston, Michael
    Chang, Kai -Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10951 - 10961
  • [50] Multimodal alignment augmentation transferable attack on vision-language pre-training models
    Fu, Tingchao
    Zhang, Jinhong
    Li, Fanxiao
    Wei, Ping
    Zeng, Xianglong
    Zhou, Wei
    PATTERN RECOGNITION LETTERS, 2025, 191 : 131 - 137