Multi-Modal Contrastive Pre-training for Recommendation

被引:0
|
作者
Liu, Zhuang [1 ]
Ma, Yunpu [2 ]
Schubert, Matthias [2 ]
Ouyang, Yuanxin [1 ]
Xiong, Zhang [3 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Ludwig Maximilians Univ Munchen, Lehrstuhl Datenbanksyst & Data Min, Munich, Germany
[3] Beihang Univ, Minist Educ, Engn Res Ctr Adv Comp Applicat Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Recommender system; Multi-modal side information; Contrastive learning; Pre-training model;
D O I
10.1145/3512527.3531378
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Personalized recommendation plays a central role in various online applications. To provide quality recommendation service, it is of crucial importance to consider multi-modal information associated with users and items, e.g., review text, description text, and images. However, many existing approaches do not fully explore and fuse multiple modalities. To address this problem, we propose a multi-modal contrastive pre-training model for recommendation. We first construct a homogeneous item graph and a user graph based on the relationship of co-interaction. For users, we propose intra-modal aggregation and inter-modal aggregation to fuse review texts and the structural information of the user graph. For items, we consider three modalities: description text, images, and item graph. Moreover, the description text and image complement each other for the same item. One of them can be used as promising supervision for the other. Therefore, to capture this signal and better exploit the potential correlation of intra-modalities, we propose a self-supervised contrastive inter-modal alignment task to make the textual and visual modalities as similar as possible. Then, we apply inter-modal aggregation to obtain the multi-modal representation of items. Next, we employ a binary cross-entropy loss function to capture the potential correlation between users and items. Finally, we fine-tune the pre-trained multi-modal representations using an existing recommendation model. We have performed extensive experiments on three real-world datasets. Experimental results verify the rationality and effectiveness of the proposed method.
引用
收藏
页码:99 / 108
页数:10
相关论文
共 50 条
  • [31] LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
    Tu, Yi
    Guo, Ya
    Chen, Huan
    Tang, Jinyang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15200 - 15212
  • [32] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
    Salemi, Alireza
    Rafiee, Mahta
    Zamani, Hamed
    PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
  • [33] LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
    Xu, Yang
    Xu, Yiheng
    Lv, Tengchao
    Cui, Lei
    Wei, Furu
    Wang, Guoxin
    Lu, Yijuan
    Florencio, Dinei
    Zhang, Cha
    Che, Wanxiang
    Zhang, Min
    Zhou, Lidong
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 2579 - 2591
  • [34] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
    Yuan, Zheng
    Jin, Qiao
    Tan, Chuanqi
    Zhao, Zhengyun
    Yuan, Hongyi
    Huang, Fei
    Huang, Songfang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
  • [35] Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training
    Yin, Bing
    Yin, Shi
    Liu, Cong
    Zhang, Yanyong
    Xi, Changfeng
    Yin, Baocai
    Ling, Zhenhua
    IET COMPUTER VISION, 2024, 18 (01) : 33 - 45
  • [36] WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data
    Song, Ruihua
    MMPT '21: PROCEEDINGS OF THE 2021 WORKSHOP ON MULTI-MODAL PRE-TRAINING FOR MULTIMEDIA UNDERSTANDING, 2021, : 3 - 3
  • [37] Multi-modal U-Nets with Boundary Loss and Pre-training for Brain Tumor Segmentation
    Lorenzo, Pablo Ribalta
    Marcinkiewicz, Michal
    Nalepa, Jakub
    BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES (BRAINLES 2019), PT II, 2020, 11993 : 135 - 147
  • [38] A multi-modal pre-training transformer for universal transfer learning in metal-organic frameworks
    Kang, Yeonghun
    Park, Hyunsoo
    Smit, Berend
    Kim, Jihan
    NATURE MACHINE INTELLIGENCE, 2023, 5 (03) : 309 - 318
  • [39] Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
    Su, Weijie
    Zhu, Xizhou
    Tao, Chenxin
    Lu, Lewei
    Li, Bin
    Huang, Gao
    Qiao, Yu
    Wang, Xiaogang
    Zhou, Jie
    Dai, Jifeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15888 - 15899
  • [40] Using contrastive language-image pre-training for Thai recipe recommendation
    Chuenbanluesuk, Thanatkorn
    Plodprong, Voramate
    Karoon, Weerasak
    Rueangsri, Kotchakorn
    Pojam, Suthasinee
    Siriborvornratanakul, Thitirat
    LANGUAGE RESOURCES AND EVALUATION, 2025,