Multimodal Knowledge Graph-guided Cross-Modal Graph Network for Image-text Retrieval

被引:0
|
作者
Zheng, Juncheng [1 ]
Liang, Meiyu [1 ]
Yu, Yang [1 ]
Du, Junping [1 ]
Xue, Zhe [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Comp Sci, Natl Pilot Software Engn Sch, Beijing Key Lab Intelligent Commun Software & Mul, Beijing 100876, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal Fusion; Image-Text Retrieval; Multimodal Konwledge Graph;
D O I
10.1109/BigComp60711.2024.00024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image-text retrieval is a fundamental cross-modal task, which dedicates to align the representation space between image modality and text modality. Existing cross-interactive image-text retrieval methods generate image and sentence embeddings independently, introduce interaction-based networks for cross-modal reasoning, and then retrieve them using matching metrics. However, existing approaches do not consider fully utilizing semantic relationships among multimodal knowledge to enhance cross-modal fine-grained implicit semantic reasoning capabilities. In this paper, we propose Multimodal Knowledge Graph-guided Cross-modal Graph Network (MKCGN) that exploits multimodal knowledge graphs to explore cross-modal relationships and enhance global representations. In MKCGN, images generate semantic and spatial graphs, which are used to represent visual graphs, and sentences generate textual graphs based on word semantic relations. The visual and textual graphs are used to implement inter-modal reasoning respectively. Then we obtain interest embeddings of image regions and text words based on entity embeddings in Multimodal Knowledge Graph (MKG), which approximates and aligns the representation space of regions and words to a certain extent, thus obtaining effective inter-modal interactions and learning fine-grained cross-modal communication through graph node contrast loss for inter-modal semantic reasoning. Finally, we mine the implicit semantics and potential relationships of images and texts through the MKG as a means of enhancing the global representations and use cross-modal contrast loss to narrow the space of coarse-grained cross-modal representations. Experiments on the MS-COCO and Flickr30K benchmark datasets show that our proposed MKCGN outperforms state-of-the-art image-text retrieval methods.
引用
收藏
页码:97 / 100
页数:4
相关论文
共 50 条
  • [31] Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval
    Yang C.
    Liu L.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (05): : 751 - 759
  • [32] Cross-modal Semantically Augmented Network for Image-text Matching
    Yao, Tao
    Li, Yiru
    Li, Ying
    Zhu, Yingying
    Wang, Gang
    Yue, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)
  • [33] SAM: cross-modal semantic alignments module for image-text retrieval
    Pilseo Park
    Soojin Jang
    Yunsung Cho
    Youngbin Kim
    Multimedia Tools and Applications, 2024, 83 : 12363 - 12377
  • [34] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Haoyu Lu
    Yuqi Huo
    Mingyu Ding
    Nanyi Fei
    Zhiwu Lu
    Machine Intelligence Research, 2023, 20 : 569 - 582
  • [35] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Lu, Haoyu
    Huo, Yuqi
    Ding, Mingyu
    Fei, Nanyi
    Lu, Zhiwu
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
  • [36] Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Panda, Rameswar
    Papalexakis, Evangelos E.
    Roy-Chowdhury, Amit K.
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1856 - 1864
  • [37] SAM: cross-modal semantic alignments module for image-text retrieval
    Park, Pilseo
    Jang, Soojin
    Cho, Yunsung
    Kim, Youngbin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12363 - 12377
  • [38] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
    Zeng, Sheng
    Liu, Changhong
    Zhou, Jun
    Chen, Yong
    Jiang, Aiwen
    Li, Hanxi
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248
  • [39] An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval
    Zhang, Jinzhi
    Wang, Luyao
    Zheng, Fuzhong
    Wang, Xu
    Zhang, Haisu
    REMOTE SENSING, 2024, 16 (12)
  • [40] RICH: A rapid method for image-text cross-modal hash retrieval
    Li, Bo
    Yao, Dan
    Li, Zhixin
    DISPLAYS, 2023, 79