VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

被引:16
|
作者
Bakkali, Souhail [1 ]
Ming, Zuheng [1 ,2 ]
Coustaty, Mickael [1 ]
Rusinol, Marcal [3 ,4 ]
Ramos Terrades, Oriol [3 ]
机构
[1] La Rochelle Univ, L3i, La Rochelle, France
[2] Univ Sorbonne Paris Nord, L2TI, Villetaneuse, France
[3] Univ Autonoma Barcelona, CVC, Barcelona, Spain
[4] AllRead MLT, Barcelona, Spain
关键词
Multimodal document representation; learning; Document classification; Contrastive learning; Self-Attention; Transformers;
D O I
10.1016/j.patcog.2023.109419
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal learning from document data has achieved great success lately as it allows to pre-train se-mantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vi-sion cues, considering intra-and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra-and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space. Extensive experiments on public benchmark datasets demonstrate the effectiveness and the generality of our model both on low-scale and large-scale datasets. (c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
    Jian, Yiren
    Gao, Chongyang
    Vosoughi, Soroush
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [22] RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
    Zhou, Chulun
    Liang, Yunlong
    Meng, Fandong
    Xu, Jinan
    Su, Jinsong
    Zhou, Jie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11747 - 11762
  • [23] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*
    Luo, Jianjie
    Li, Yehao
    Pan, Yingwei
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5600 - 5608
  • [24] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [25] Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging
    Liang, Gongbo
    Greenwell, Connor
    Zhang, Yu
    Xing, Xin
    Wang, Xiaoqin
    Kavuluru, Ramakanth
    Jacobs, Nathan
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (04) : 1640 - 1649
  • [26] Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition
    Lu, Mingqi
    Yang, Siyuan
    Lu, Xiaobo
    Liu, Jun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9798 - 9807
  • [27] MEDICAL VISION-LANGUAGE REPRESENTATION LEARNING WITH CROSS-MODAL MULTI-TEACHER CONTRASTIVE DISTILLATION
    Chen, Bingzhi
    Zhu, Jiawei
    Liu, Yishu
    Zeng, Biqing
    Pan, Jiahui
    Ding, Meirong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1891 - 1895
  • [28] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
    Zhang, Taolin
    He, Sunan
    Dai, Tao
    Wang, Zhi
    Chen, Bin
    Xia, Shu-Tao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7296 - 7304
  • [29] CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
    Li, Hang
    Ding, Wenbiao
    Kang, Yu
    Liu, Tianqiao
    Wu, Zhongqin
    Liu, Zitao
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3966 - 3977
  • [30] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271