Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引：0

作者：

Feng, Wenjun ^{[1
]}

Lin, Dazhen ^{[1
]}

Cao, Donglin ^{[1
]}

机构：

[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I | 2024年 / 14425卷

基金：

中国国家自然科学基金;

关键词：

Multimodal causal discovery; Image-to-text retrieval; CLIP;

D O I：

10.1007/978-981-99-8429-9_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.

引用

页码：210 / 221

页数：12

共 50 条

[1] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
Fang, Han
Xiong, Pengfei
Xu, Luhui
Luo, Wenhan
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
[2] Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation
Gou, Yunhao
Chen, Kai
Liu, Zhili
Hong, Lanqing
Xu, Hang
Li, Zhenguo
Yeung, Dit-Yan
Kwok, James T.
Zhang, Yu
COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 388 - 404
[3] On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval
Gong, Yan
Cosma, Georgina
Fang, Hui
JOURNAL OF IMAGING, 2021, 7 (08)
[4] Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval
Ma, Ying
Wang, Meng
Lu, Guangyun
Sun, Yajun
VISUAL COMPUTER, 2025, 41 (03): : 1827 - 1840
[5] Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
Wang, Hao
Lin, Guosheng
Hoi, Steven
Miao, Chunyan
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5517 - 5526
[6] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
Feng, Duoduo
He, Xiangteng
Peng, Yuxin
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
[7] Evaluating Text-to-Visual Generation with Image-to-Text Generation
Lin, Zhiqiu
Athaki, Deepak
Li, Baiqi
Li, Jiayao
Xia, Xide
Neubig, Graham
Zhang, Pengchuan
Ramanan, Deva
COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 366 - 384
[8] Shapley visual transformers for image-to-text generation
Belhadi, Asma
Djenouri, Youcef
Belbachir, Ahmed Nabil
Michalak, Tomasz
Srivastava, Gautam
APPLIED SOFT COMPUTING, 2024, 166
[9] CLIP2TF:Multimodal video-text retrieval for adolescent education
Sun, Xiaoning
Fan, Tao
Li, Hongxu
Wang, Guozhong
Ge, Peien
Shang, Xiwu
DISPLAYS, 2024, 84
[10] Causal image-text retrieval embedded with consensus knowledge
Liang Y.
Liu X.
Ma Z.
Li Z.
Gongcheng Kexue Xuebao/Chinese Journal of Engineering, 2024, 46 (02): : 317 - 328

← 1 2 3 4 5 →