Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

被引：0

作者：

Liu, Zikang ^{[1
]}

Chen, Sihan ^{[1
]}

Guo, Longteng ^{[1
]}

Li, Handong ^{[1
]}

He, Xingjian ^{[1
]}

Liu, Jing ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Vision-Language Pre-Training; Pre-Training Data Generation;

D O I：

10.1145/3581783.3612388

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration ( JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset are available at https://github.com/johncaged/OPT_Questioner.

引用

页码：5120 / 5131

页数：12

共 50 条

[41] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
Xu, Hanwen
ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
[42] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai
Wang, Junjie
Gong, Yuan
Zhang, Lin
Zhu, Yanru
Wang, Hongfa
Zhang, Jiaxing
Sakai, Tetsuya
Yang, Yujiu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
[43] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
Wang, Shengyi
El-Gohary, Nora
COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
[44] Leveraging per Image-Token Consistency for Vision-Language Pre-training
Gou, Yunhao
Ko, Tom
Yang, Hansi
Kwok, James
Zhang, Yu
Wang, Mingxuan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
[45] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
Hong, Weixiang
Ji, Kaixiang
Liu, Jiajia
Wang, Jian
Chen, Jingdong
Chu, Wei
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
[46] Multimodal detection of hateful memes by applying a vision-language pre-training model
Chen, Yuyang
Pan, Feng
PLOS ONE, 2022, 17 (09):
[47] Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Dai, Wenliang
Liu, Zihan
Ji, Ziwei
Su, Dan
Fung, Pascale
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2136 - 2148
[48] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
Yao, Tao
Peng, Shouyong
Wang, Lili
Li, Ying
Sun, Yujuan
APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
[49] GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Yin, Da
Gao, Feng
Thattai, Govind
Johnston, Michael
Chang, Kai -Wei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10951 - 10961
[50] Multimodal alignment augmentation transferable attack on vision-language pre-training models
Fu, Tingchao
Zhang, Jinhong
Li, Fanxiao
Wei, Ping
Zeng, Xianglong
Zhou, Wei
PATTERN RECOGNITION LETTERS, 2025, 191 : 131 - 137

← 1 2 3 4 5 →