Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引:0
|
作者
Feng, Wenjun [1 ]
Lin, Dazhen [1 ]
Cao, Donglin [1 ]
机构
[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal causal discovery; Image-to-text retrieval; CLIP;
D O I
10.1007/978-981-99-8429-9_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.
引用
收藏
页码:210 / 221
页数:12
相关论文
共 50 条
  • [31] TIAR: Text-Image-Audio Retrieval with weighted multimodal re-ranking
    Peide Chi
    Yong Feng
    Mingliang Zhou
    Xian-cai Xiong
    Yong-heng Wang
    Bao-hua Qiang
    Applied Intelligence, 2023, 53 : 22898 - 22916
  • [32] Multimodal biomedical image indexing and retrieval using descriptive text and global feature mapping
    Simpson, Matthew S.
    Demner-Fushman, Dina
    Antani, Sameer K.
    Thoma, George R.
    INFORMATION RETRIEVAL, 2014, 17 (03): : 229 - 264
  • [33] Multimodal biomedical image indexing and retrieval using descriptive text and global feature mapping
    Matthew S. Simpson
    Dina Demner-Fushman
    Sameer K. Antani
    George R. Thoma
    Information Retrieval, 2014, 17 : 229 - 264
  • [34] A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval
    Li, Yi
    Wu, Dehao
    Zhu, Yuesheng
    MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 431 - 442
  • [35] TIAR: Text-Image-Audio Retrieval with weighted multimodal re-ranking
    Chi, Peide
    Feng, Yong
    Zhou, Mingliang
    Xiong, Xian-cai
    Wang, Yong-heng
    Qiang, Bao-hua
    APPLIED INTELLIGENCE, 2023, 53 (19) : 22898 - 22916
  • [36] VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 29 - 33
  • [37] AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO
    Guzhov, Andrey
    Raue, Federico
    Hees, Joern
    Dengel, Andreas
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 976 - 980
  • [38] Multimodal image retrieval model based on semantic-enhanced feature fusion
    Yang F.
    Ning B.
    Li H.-Q.
    Zhou X.
    Li G.-Y.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2023, 57 (02): : 252 - 258
  • [40] Multimodal Retrieval by Text-Segment Biclustering
    Benczur, Andras
    Biro, Istvan
    Brendel, Matyas
    Csalogany, Karoly
    Daroczy, Balint
    Siklosi, David
    ADVANCES IN MULTILINGUAL AND MULTIMODAL INFORMATION RETRIEVAL, 2008, 5152 : 518 - 521