Complementarity is the king: Multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval

被引:6
|
作者
Pei, Xinlei [1 ,2 ]
Liu, Zheng [1 ,2 ]
Gao, Shanshan [1 ,2 ]
Su, Yijun [3 ]
机构
[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Shandong, Peoples R China
[2] Shandong Univ Finance & Econ, Shandong Prov Key Lab Digital Media Technol, Jinan 250014, Shandong, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
关键词
Cross-modal retrieval; Primary similarity; Auxiliary similarity; Semantic enhancement; Multi-spring balance loss;
D O I
10.1016/j.eswa.2022.119415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval takes a query of one modality to retrieve relevant results from another modality, and its key issue lies in how to learn the cross-modal similarity. Note that the complete semantic information of a specific concept is widely scattered over the multi-modal and multi-grained data, and it cannot be thoroughly captured by most existing methods to learn the cross-modal similarity accurately. Therefore, we propose a Multi-modal and Multi-grained Hierarchical Semantic Enhancement network (M2HSE), which contains two stages to obtain more complete semantic information by fusing the complementarity in multi modal and multi-grained data. In stage 1, two classes of cross-modal similarity (primary similarity and auxiliary similarity) are calculated more comprehensively in two subnetworks. Especially, the primary similarities from two subnetworks are fused to perform the cross-modal retrieval, while the auxiliary similarity provides a valuable complement for the primary similarity. In stage 2, the multi-spring balance loss is proposed to optimize the cross-modal similarity more flexibly. Utilizing this loss, the most representative samples are selected to establish the multi-spring balance system, which adaptively optimizes the cross-modal similarities until reaching the equilibrium state. Extensive experiments conducted on public benchmark datasets clearly prove the effectiveness of our proposed method and show its competitive performance with the state-of-the-arts.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Multi-Modal Sarcasm Detection with Interactive In-Modal and Cross-Modal Graphs
    Liang, Bin
    Lou, Chenwei
    Li, Xiang
    Gui, Lin
    Yang, Min
    Xu, Ruifeng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4707 - 4715
  • [42] Cross-modal attention network for retinal disease classification based on multi-modal images
    Liu, Zirong
    Hu, Yan
    Qiu, Zhongxi
    Niu, Yanyan
    Zhou, Dan
    Li, Xiaoling
    Shen, Junyong
    Jiang, Hongyang
    Li, Heng
    Liu, Jiang
    BIOMEDICAL OPTICS EXPRESS, 2024, 15 (06): : 3699 - 3714
  • [43] Multi-Label Cross-modal Retrieval
    Ranjan, Viresh
    Rasiwasia, Nikhil
    Jawahar, C. V.
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4094 - 4102
  • [44] Multi-Modal Attention Network Learning for Semantic Source Code Retrieval
    Wan, Yao
    Shu, Jingdong
    Sui, Yulei
    Xu, Guandong
    Zhao, Zhou
    Wu, Jian
    Yu, Philip S.
    34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 13 - 25
  • [45] Hierarchical Semantic Structure Preserving Hashing for Cross-Modal Retrieval
    Wang, Di
    Zhang, Caiping
    Wang, Quan
    Tian, Yumin
    He, Lihuo
    Zhao, Lin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1217 - 1229
  • [46] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    MATHEMATICS, 2022, 10 (18)
  • [47] Multi-label adversarial fine-grained cross-modal retrieval
    Sun, Chunpu
    Zhang, Huaxiang
    Liu, Li
    Liu, Dongmei
    Wang, Lin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 117
  • [48] Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval
    Xu, Xing
    Song, Jingkuan
    Lu, Huimin
    Yang, Yang
    Shen, Fumin
    Huang, Zi
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 46 - 54
  • [49] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
    Zolfaghari, Mohammadreza
    Zhu, Yi
    Gehler, Peter
    Brox, Thomas
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1430 - 1439
  • [50] Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection
    Wang, Jie
    Yang, Yan
    Jiang, Yongquan
    Ma, Minbo
    Xie, Zhuyang
    Li, Tianrui
    INFORMATION FUSION, 2024, 103