Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引:0
|
作者
Feng, Wenjun [1 ]
Lin, Dazhen [1 ]
Cao, Donglin [1 ]
机构
[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal causal discovery; Image-to-text retrieval; CLIP;
D O I
10.1007/978-981-99-8429-9_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.
引用
收藏
页码:210 / 221
页数:12
相关论文
共 50 条
  • [41] Text-image multimodal fusion model for enhanced fake news detection
    Lin, Szu-Yin
    Chen, Yen-Chiu
    Chang, Yu-Han
    Lo, Shih-Hsin
    Chao, Kuo-Ming
    SCIENCE PROGRESS, 2024, 107 (04)
  • [42] A Vision Enhanced Framework for Indonesian Multimodal Abstractive Text-Image Summarization
    Song, Yutao
    Lin, Nankai
    Li, Lingbao
    Jiang, Shengyi
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 61 - 66
  • [43] Animating Images to Transfer CLIP for Video-Text Retrieval
    Liu, Yu
    Chen, Huai
    Huang, Lianghua
    Chen, Di
    Wang, Bin
    Pan, Pan
    Wang, Lisheng
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1906 - 1911
  • [44] Multilevel Language and Vision Integration for Text-to-Clip Retrieval
    Xu, Huijuan
    He, Kun
    Plummer, Bryan A.
    Sigal, Leonid
    Sclaroff, Stan
    Saenko, Kate
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9062 - 9069
  • [45] SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval
    Ji, Zhong
    Wang, Haoran
    Han, Jungong
    Pang, Yanwei
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1086 - 1097
  • [46] Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
    Liu, Chong
    Zhang, Yuqi
    Wang, Hongsong
    Chen, Weihua
    Wang, Fan
    Huang, Yan
    Shen, Yi-Dong
    Wang, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3622 - 3633
  • [47] A System of Multimodal Image-Text Retrieval Based on Pre-Trained Models Fusion
    Li, Qiang
    Zhao, Feng
    Zhao, Linlin
    Liu, Maokai
    Wang, Yubo
    Zhang, Shuo
    Guo, Yuanyuan
    Wang, Shunlu
    Wang, Weigang
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2025, 37 (03):
  • [48] Multimodal medical image retrieval system
    Kitanovski, Ivan
    Strezoski, Gjorgji
    Dimitrovski, Ivica
    Madjarov, Gjorgji
    Loskovska, Suzana
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (02) : 2955 - 2978
  • [49] Exploiting multimodal context in image retrieval
    Srihari, RK
    Zhang, ZF
    LIBRARY TRENDS, 1999, 48 (02) : 496 - 520
  • [50] Multimodal medical image retrieval system
    Ivan Kitanovski
    Gjorgji Strezoski
    Ivica Dimitrovski
    Gjorgji Madjarov
    Suzana Loskovska
    Multimedia Tools and Applications, 2017, 76 : 2955 - 2978