Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引：0

作者：

Feng, Wenjun ^{[1
]}

Lin, Dazhen ^{[1
]}

Cao, Donglin ^{[1
]}

机构：

[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I | 2024年 / 14425卷

基金：

中国国家自然科学基金;

关键词：

Multimodal causal discovery; Image-to-text retrieval; CLIP;

D O I：

10.1007/978-981-99-8429-9_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.

引用

页码：210 / 221

页数：12

共 50 条

[21] NeoDescriber: An image-to-text model for automatic style description of neoclassical architecture
Qin, Wenke
Chen, Lang
Zhang, Boyi
Chen, Weiya
Luo, Hanbin
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
[22] Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
Miura, Yasuhide
Zhang, Yuhao
Tsai, Emily Bao
Langlotz, Curtis P.
Jurafsky, Dan
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5288 - 5304
[23] Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
Wasim, Syed Talal
Naseer, Muzammal
Khan, Salman
Khan, Fahad Shahbaz
Shah, Mubarak
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23034 - 23044
[24] Text-to-Image Retrieval Based on Incremental Association via Multimodal Hypernetworks
Ha, Jung-Woo
Lee, Beom-Jin
Zhang, Byoung-Tak
PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 3245 - 3250
[25] THE ROLE OF CAUSAL CONNECTIONS IN THE RETRIEVAL OF TEXT
OBRIEN, EJ
MYERS, JL
MEMORY & COGNITION, 1987, 15 (05) : 419 - 427
[26] Semantic Enhanced Sketch Based Image Retrieval with Incomplete Multimodal Query
Das Bhattacharjee, Sreyasee
Yuan, Junsong
2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2020), 2020, : 86 - 93
[27] Using image-to-text recognition technology to facilitate vocabulary acquisition in authentic contexts
Shadiev, Rustam
Wu, Ting-Ting
Huang, Yueh-Min
RECALL, 2020, 32 (02) : 195 - 212
[28] Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models
Huang, Jia-Hong
Zhu, Hongyi
Shen, Yixian
Rudinac, Stevan
Kanoulas, Evangelos
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 413 - 427
[29] Understanding image-text relations and news values for multimodal news analysis
Cheema, Gullal S.
Hakimov, Sherzod
Mueller-Budack, Eric
Otto, Christian
Bateman, John A.
Ewerth, Ralph
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
[30] Sequential Structured Fusion of Image and Text for Enhanced Multimodal Abstractive Summarization
He, Rui
Qi, Minjie
Wang, Hongling
Wang, Zhongqing
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT IV, NLPCC 2024, 2025, 15362 : 290 - 302

← 1 2 3 4 5 →