VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

被引：16

作者：

Bakkali, Souhail ^{[1
]}

Ming, Zuheng ^{[1
,2
]}

Coustaty, Mickael ^{[1
]}

Rusinol, Marcal ^{[3
,4
]}

Ramos Terrades, Oriol ^{[3
]}

机构：

[1] La Rochelle Univ, L3i, La Rochelle, France

[2] Univ Sorbonne Paris Nord, L2TI, Villetaneuse, France

[3] Univ Autonoma Barcelona, CVC, Barcelona, Spain

[4] AllRead MLT, Barcelona, Spain

来源：

PATTERN RECOGNITION | 2023年 / 139卷

关键词：

Multimodal document representation; learning; Document classification; Contrastive learning; Self-Attention; Transformers;

D O I：

10.1016/j.patcog.2023.109419

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal learning from document data has achieved great success lately as it allows to pre-train se-mantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vi-sion cues, considering intra-and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra-and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space. Extensive experiments on public benchmark datasets demonstrate the effectiveness and the generality of our model both on low-scale and large-scale datasets. (c) 2023 Elsevier Ltd. All rights reserved.

引用

页数：11

共 50 条

[31] Multimodal detection of hateful memes by applying a vision-language pre-training model
Chen, Yuyang
Pan, Feng
PLOS ONE, 2022, 17 (09):
[32] CoCM: Conditional Cross-Modal Learning for Vision-Language Models
Yang, Juncheng
Xie, Shuai
Li, Shuxia
Cai, Zengyu
Li, Yijia
Zhu, Weiping
ELECTRONICS, 2025, 14 (01):
[33] Cross-Modal Concept Learning and Inference for Vision-Language Models
Zhang, Yi
Zhang, Ce
Tang, Yushun
He, Zhihai
NEUROCOMPUTING, 2024, 583
[34] Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Zhou, Mingyang
Fung, Yi R.
Chen, Long
Thomas, Christopher
Ji, Heng
Chang, Shih-Fu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1314 - 1326
[35] Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
Wang, Tzu-Jui Julius
Laaksonen, Jorma
Langer, Tomas
Arponen, Heikki
Bishop, Tom E.
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1073 - 1083
[36] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
Wen, Zhoufutu
Zhao, Xinyu
Jin, Zhipeng
Yang, Yi
Jia, Wei
Chen, Xiaodong
Li, Shuanglong
Liu, Lin
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
[37] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Radenovic, Filip
Dubey, Abhimanyu
Kadian, Abhishek
Mihaylov, Todor
Vandenhende, Simon
Patel, Yash
Wen, Yi
Ramanathan, Vignesh
Mahajan, Dhruv
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6967 - 6977
[38] Transferable Multimodal Attack on Vision-Language Pre-training Models
Wang, Haodi
Dong, Kai
Zhu, Zhilei
Qin, Haotong
Liu, Aishan
Fang, Xiaolin
Wang, Jiakai
Liu, Xianglong
45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 1722 - 1740
[39] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
[40] Too Large; Data Reduction for Vision-Language Pre-Training
Wang, Alex Jinpeng
Lin, Kevin Qinghong
Zhang, David Junhao
Lei, Stan Weixian
Shou, Mike Zheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3124 - 3134

← 1 2 3 4 5 →