Complementarity is the king: Multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval

被引:6
|
作者
Pei, Xinlei [1 ,2 ]
Liu, Zheng [1 ,2 ]
Gao, Shanshan [1 ,2 ]
Su, Yijun [3 ]
机构
[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Shandong, Peoples R China
[2] Shandong Univ Finance & Econ, Shandong Prov Key Lab Digital Media Technol, Jinan 250014, Shandong, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
关键词
Cross-modal retrieval; Primary similarity; Auxiliary similarity; Semantic enhancement; Multi-spring balance loss;
D O I
10.1016/j.eswa.2022.119415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval takes a query of one modality to retrieve relevant results from another modality, and its key issue lies in how to learn the cross-modal similarity. Note that the complete semantic information of a specific concept is widely scattered over the multi-modal and multi-grained data, and it cannot be thoroughly captured by most existing methods to learn the cross-modal similarity accurately. Therefore, we propose a Multi-modal and Multi-grained Hierarchical Semantic Enhancement network (M2HSE), which contains two stages to obtain more complete semantic information by fusing the complementarity in multi modal and multi-grained data. In stage 1, two classes of cross-modal similarity (primary similarity and auxiliary similarity) are calculated more comprehensively in two subnetworks. Especially, the primary similarities from two subnetworks are fused to perform the cross-modal retrieval, while the auxiliary similarity provides a valuable complement for the primary similarity. In stage 2, the multi-spring balance loss is proposed to optimize the cross-modal similarity more flexibly. Utilizing this loss, the most representative samples are selected to establish the multi-spring balance system, which adaptively optimizes the cross-modal similarities until reaching the equilibrium state. Extensive experiments conducted on public benchmark datasets clearly prove the effectiveness of our proposed method and show its competitive performance with the state-of-the-arts.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Multi-modal Subspace Learning with Dropout regularization for Cross-modal Recognition and Retrieval
    Cao, Guanqun
    Waris, Muhammad Adeel
    Iosifidis, Alexandros
    Gabbouj, Moncef
    2016 SIXTH INTERNATIONAL CONFERENCE ON IMAGE PROCESSING THEORY, TOOLS AND APPLICATIONS (IPTA), 2016,
  • [22] Multi-modal Subspace Learning with Joint Graph Regularization for Cross-modal Retrieval
    Wang, Kaiye
    Wang, Wei
    He, Ran
    Wang, Liang
    Tan, Tieniu
    2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 236 - 240
  • [23] CA_DeepSC: Cross-Modal Alignment for Multi-Modal Semantic Communications
    Wang, Wenjun
    Liu, Minghao
    Chen, Mingkai
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 5871 - 5876
  • [24] Multi-Modal Medical Image Matching Based on Multi-Task Learning and Semantic-Enhanced Cross-Modal Retrieval
    Zhang, Yilin
    TRAITEMENT DU SIGNAL, 2023, 40 (05) : 2041 - 2049
  • [25] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [26] A Framework for Enabling Unpaired Multi-Modal Learning for Deep Cross-Modal Hashing Retrieval
    Williams-Lekuona, Mikel
    Cosma, Georgina
    Phillips, Iain
    JOURNAL OF IMAGING, 2022, 8 (12)
  • [27] Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval
    Zou, Zhuoyang
    Zhu, Xinghui
    Zhu, Qinying
    Zhang, Hongyan
    Zhu, Lei
    FOODS, 2024, 13 (11)
  • [28] Adversarial Cross-modal Domain Adaptation for Multi-modal Semantic Segmentation in Autonomous Driving
    Shi, Mengqi
    Cao, Haozhi
    Xie, Lihua
    Yang, Jianfei
    2022 17TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV), 2022, : 850 - 855
  • [29] Multi-Modal Sarcasm Detection via Cross-Modal Graph Convolutional Network
    Liang, Bin
    Lou, Chenwei
    Li, Xiang
    Yang, Min
    Gui, Lin
    He, Yulan
    Pei, Wenjie
    Xu, Ruifeng
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1767 - 1777
  • [30] Cross-modal generative models for multi-modal plastic sorting
    Neo, Edward R. K.
    Low, Jonathan S. C.
    Goodship, Vannessa
    Coles, Stuart R.
    Debattista, Kurt
    JOURNAL OF CLEANER PRODUCTION, 2023, 415