VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

被引：16

作者：

Bakkali, Souhail ^{[1
]}

Ming, Zuheng ^{[1
,2
]}

Coustaty, Mickael ^{[1
]}

Rusinol, Marcal ^{[3
,4
]}

Ramos Terrades, Oriol ^{[3
]}

机构：

[1] La Rochelle Univ, L3i, La Rochelle, France

[2] Univ Sorbonne Paris Nord, L2TI, Villetaneuse, France

[3] Univ Autonoma Barcelona, CVC, Barcelona, Spain

[4] AllRead MLT, Barcelona, Spain

来源：

PATTERN RECOGNITION | 2023年 / 139卷

关键词：

Multimodal document representation; learning; Document classification; Contrastive learning; Self-Attention; Transformers;

D O I：

10.1016/j.patcog.2023.109419

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal learning from document data has achieved great success lately as it allows to pre-train se-mantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vi-sion cues, considering intra-and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra-and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space. Extensive experiments on public benchmark datasets demonstrate the effectiveness and the generality of our model both on low-scale and large-scale datasets. (c) 2023 Elsevier Ltd. All rights reserved.

引用

页数：11

共 50 条

[21] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Jian, Yiren
Gao, Chongyang
Vosoughi, Soroush
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[22] RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
Zhou, Chulun
Liang, Yunlong
Meng, Fandong
Xu, Jinan
Su, Jinsong
Zhou, Jie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11747 - 11762
[23] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*
Luo, Jianjie
Li, Yehao
Pan, Yingwei
Yao, Ting
Chao, Hongyang
Mei, Tao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5600 - 5608
[24] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
Liang, Mingliang
Larson, Martha
PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
[25] Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging
Liang, Gongbo
Greenwell, Connor
Zhang, Yu
Xing, Xin
Wang, Xiaoqin
Kavuluru, Ramakanth
Jacobs, Nathan
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (04) : 1640 - 1649
[26] Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition
Lu, Mingqi
Yang, Siyuan
Lu, Xiaobo
Liu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9798 - 9807
[27] MEDICAL VISION-LANGUAGE REPRESENTATION LEARNING WITH CROSS-MODAL MULTI-TEACHER CONTRASTIVE DISTILLATION
Chen, Bingzhi
Zhu, Jiawei
Liu, Yishu
Zeng, Biqing
Pan, Jiahui
Ding, Meirong
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1891 - 1895
[28] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
Zhang, Taolin
He, Sunan
Dai, Tao
Wang, Zhi
Chen, Bin
Xia, Shu-Tao
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7296 - 7304
[29] CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Li, Hang
Ding, Wenbiao
Kang, Yu
Liu, Tianqiao
Wu, Zhongqin
Liu, Zitao
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3966 - 3977
[30] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai
Wang, Junjie
Gong, Yuan
Zhang, Lin
Zhu, Yanru
Wang, Hongfa
Zhang, Jiaxing
Sakai, Tetsuya
Yang, Yujiu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271

← 1 2 3 4 5 →