Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引:0
|
作者
Feng, Wenjun [1 ]
Lin, Dazhen [1 ]
Cao, Donglin [1 ]
机构
[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal causal discovery; Image-to-text retrieval; CLIP;
D O I
10.1007/978-981-99-8429-9_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.
引用
收藏
页码:210 / 221
页数:12
相关论文
共 50 条
  • [1] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
    Fang, Han
    Xiong, Pengfei
    Xu, Luhui
    Luo, Wenhan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
  • [2] Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation
    Gou, Yunhao
    Chen, Kai
    Liu, Zhili
    Hong, Lanqing
    Xu, Hang
    Li, Zhenguo
    Yeung, Dit-Yan
    Kwok, James T.
    Zhang, Yu
    COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 388 - 404
  • [3] On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval
    Gong, Yan
    Cosma, Georgina
    Fang, Hui
    JOURNAL OF IMAGING, 2021, 7 (08)
  • [4] Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval
    Ma, Ying
    Wang, Meng
    Lu, Guangyun
    Sun, Yajun
    VISUAL COMPUTER, 2025, 41 (03): : 1827 - 1840
  • [5] Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven
    Miao, Chunyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5517 - 5526
  • [6] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
    Feng, Duoduo
    He, Xiangteng
    Peng, Yuxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [7] Evaluating Text-to-Visual Generation with Image-to-Text Generation
    Lin, Zhiqiu
    Athaki, Deepak
    Li, Baiqi
    Li, Jiayao
    Xia, Xide
    Neubig, Graham
    Zhang, Pengchuan
    Ramanan, Deva
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 366 - 384
  • [8] Shapley visual transformers for image-to-text generation
    Belhadi, Asma
    Djenouri, Youcef
    Belbachir, Ahmed Nabil
    Michalak, Tomasz
    Srivastava, Gautam
    APPLIED SOFT COMPUTING, 2024, 166
  • [9] CLIP2TF:Multimodal video-text retrieval for adolescent education
    Sun, Xiaoning
    Fan, Tao
    Li, Hongxu
    Wang, Guozhong
    Ge, Peien
    Shang, Xiwu
    DISPLAYS, 2024, 84
  • [10] Causal image-text retrieval embedded with consensus knowledge
    Liang Y.
    Liu X.
    Ma Z.
    Li Z.
    Gongcheng Kexue Xuebao/Chinese Journal of Engineering, 2024, 46 (02): : 317 - 328