Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引：0

作者：

Feng, Wenjun ^{[1
]}

Lin, Dazhen ^{[1
]}

Cao, Donglin ^{[1
]}

机构：

[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I | 2024年 / 14425卷

基金：

中国国家自然科学基金;

关键词：

Multimodal causal discovery; Image-to-text retrieval; CLIP;

D O I：

10.1007/978-981-99-8429-9_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.

引用

页码：210 / 221

页数：12

共 50 条

[41] Text-image multimodal fusion model for enhanced fake news detection
Lin, Szu-Yin
Chen, Yen-Chiu
Chang, Yu-Han
Lo, Shih-Hsin
Chao, Kuo-Ming
SCIENCE PROGRESS, 2024, 107 (04)
[42] A Vision Enhanced Framework for Indonesian Multimodal Abstractive Text-Image Summarization
Song, Yutao
Lin, Nankai
Li, Lingbao
Jiang, Shengyi
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 61 - 66
[43] Animating Images to Transfer CLIP for Video-Text Retrieval
Liu, Yu
Chen, Huai
Huang, Lianghua
Chen, Di
Wang, Bin
Pan, Pan
Wang, Lisheng
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1906 - 1911
[44] Multilevel Language and Vision Integration for Text-to-Clip Retrieval
Xu, Huijuan
He, Kun
Plummer, Bryan A.
Sigal, Leonid
Sclaroff, Stan
Saenko, Kate
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9062 - 9069
[45] SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval
Ji, Zhong
Wang, Haoran
Han, Jungong
Pang, Yanwei
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1086 - 1097
[46] Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
Liu, Chong
Zhang, Yuqi
Wang, Hongsong
Chen, Weihua
Wang, Fan
Huang, Yan
Shen, Yi-Dong
Wang, Liang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3622 - 3633
[47] A System of Multimodal Image-Text Retrieval Based on Pre-Trained Models Fusion
Li, Qiang
Zhao, Feng
Zhao, Linlin
Liu, Maokai
Wang, Yubo
Zhang, Shuo
Guo, Yuanyuan
Wang, Shunlu
Wang, Weigang
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2025, 37 (03):
[48] Multimodal medical image retrieval system
Kitanovski, Ivan
Strezoski, Gjorgji
Dimitrovski, Ivica
Madjarov, Gjorgji
Loskovska, Suzana
MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (02) : 2955 - 2978
[49] Exploiting multimodal context in image retrieval
Srihari, RK
Zhang, ZF
LIBRARY TRENDS, 1999, 48 (02) : 496 - 520
[50] Multimodal medical image retrieval system
Ivan Kitanovski
Gjorgji Strezoski
Ivica Dimitrovski
Gjorgji Madjarov
Suzana Loskovska
Multimedia Tools and Applications, 2017, 76 : 2955 - 2978

← 1 2 3 4 5 →