Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

被引:0
|
作者
Feng, Wenjun [1 ]
Lin, Dazhen [1 ]
Cao, Donglin [1 ]
机构
[1] Xiamen Univ, Dept Artificial Intelligence, Xiamen, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal causal discovery; Image-to-text retrieval; CLIP;
D O I
10.1007/978-981-99-8429-9_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.
引用
收藏
页码:210 / 221
页数:12
相关论文
共 50 条
  • [21] NeoDescriber: An image-to-text model for automatic style description of neoclassical architecture
    Qin, Wenke
    Chen, Lang
    Zhang, Boyi
    Chen, Weiya
    Luo, Hanbin
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [22] Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
    Miura, Yasuhide
    Zhang, Yuhao
    Tsai, Emily Bao
    Langlotz, Curtis P.
    Jurafsky, Dan
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5288 - 5304
  • [23] Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
    Wasim, Syed Talal
    Naseer, Muzammal
    Khan, Salman
    Khan, Fahad Shahbaz
    Shah, Mubarak
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23034 - 23044
  • [24] Text-to-Image Retrieval Based on Incremental Association via Multimodal Hypernetworks
    Ha, Jung-Woo
    Lee, Beom-Jin
    Zhang, Byoung-Tak
    PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 3245 - 3250
  • [25] THE ROLE OF CAUSAL CONNECTIONS IN THE RETRIEVAL OF TEXT
    OBRIEN, EJ
    MYERS, JL
    MEMORY & COGNITION, 1987, 15 (05) : 419 - 427
  • [26] Semantic Enhanced Sketch Based Image Retrieval with Incomplete Multimodal Query
    Das Bhattacharjee, Sreyasee
    Yuan, Junsong
    2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2020), 2020, : 86 - 93
  • [27] Using image-to-text recognition technology to facilitate vocabulary acquisition in authentic contexts
    Shadiev, Rustam
    Wu, Ting-Ting
    Huang, Yueh-Min
    RECALL, 2020, 32 (02) : 195 - 212
  • [28] Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models
    Huang, Jia-Hong
    Zhu, Hongyi
    Shen, Yixian
    Rudinac, Stevan
    Kanoulas, Evangelos
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 413 - 427
  • [29] Understanding image-text relations and news values for multimodal news analysis
    Cheema, Gullal S.
    Hakimov, Sherzod
    Mueller-Budack, Eric
    Otto, Christian
    Bateman, John A.
    Ewerth, Ralph
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [30] Sequential Structured Fusion of Image and Text for Enhanced Multimodal Abstractive Summarization
    He, Rui
    Qi, Minjie
    Wang, Hongling
    Wang, Zhongqing
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT IV, NLPCC 2024, 2025, 15362 : 290 - 302