Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

被引:0
|
作者
Lu, Yuhang [1 ,2 ]
Yu, Jing [1 ]
Liu, Yanbing [1 ]
Tan, Jianlong [1 ]
Guo, Li [1 ]
Zhang, Weifeng [3 ,4 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
[4] Zhejiang Future Technol Inst, Jiaxing, Peoples R China
关键词
Stacked co-attention network; Graph convolution; Fine-grained cross-modal correlation;
D O I
10.1007/978-3-319-99365-2_19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval provides a flexible way to find semantically relevant information across different modalities given a query of one modality. The main challenge is to measure the similarity between different modalities of data. Generally, different modalities contain unequal amount of information when describing the same semantics. For example, textual descriptions often contain more background information that cannot be conveyed by images and vice versa. Existing works mostly map the global data features from different modalities to a common semantic space to measure their similarity, which ignore their imbalanced and complementary relationships. In this paper, we propose stacked co-attention networks (SCANet) to progressively learn the mutually attended features of different modalities and leverage these fine-grained correlations to enhance cross-modal retrieval performance. SCANet adopts a dual-path end-to-end framework to jointly learn the multimodal representations, stacked co-attention, and similarity metric. Experiment results on three widely-used benchmark datasets verify that SCANet outperforms state-of-the-art methods, with 19% improvements on MAP in average for the best case.
引用
收藏
页码:213 / 225
页数:13
相关论文
共 50 条
  • [41] Enhancing structure modeling for relation extraction with fine-grained gating and co-attention
    Chen, Yubo
    Wu, Chuhan
    Huang, Yongfeng
    NEUROCOMPUTING, 2022, 467 : 282 - 291
  • [42] Cross-modal recipe retrieval via parallel- and cross-attention networks learning
    Cao, Da
    Chu, Jingjing
    Zhu, Ningbo
    Nie, Liqiang
    KNOWLEDGE-BASED SYSTEMS, 2020, 193
  • [43] A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval
    Zheng, Fuzhong
    Wang, Xu
    Wang, Luyao
    Zhang, Xiong
    Zhu, Hongze
    Wang, Long
    Zhang, Haisu
    SENSORS, 2023, 23 (20)
  • [44] Graph Embedding Learning for Cross-Modal Information Retrieval
    Zhang, Youcai
    Gu, Xiaodong
    NEURAL INFORMATION PROCESSING (ICONIP 2017), PT III, 2017, 10636 : 594 - 601
  • [45] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [46] A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text-Image Retrieval
    Yang, Lei
    Feng, Yong
    Zhou, Mingling
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (13)
  • [47] Stacked cross-modal feature consolidation attention networks for image captioning
    Pourkeshavarz, Mozhgan
    Nabavi, Shahabedin
    Moghaddam, Mohsen Ebrahimi
    Shamsfard, Mehrnoush
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233
  • [48] Stacked cross-modal feature consolidation attention networks for image captioning
    Mozhgan Pourkeshavarz
    Shahabedin Nabavi
    Mohsen Ebrahimi Moghaddam
    Mehrnoush Shamsfard
    Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
  • [49] Fine-grained image retrieval by combining attention mechanism and context information
    Li, Xiaoqing
    Ma, Jinwen
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (02): : 1881 - 1897
  • [50] Fine-grained image retrieval by combining attention mechanism and context information
    Xiaoqing Li
    Jinwen Ma
    Neural Computing and Applications, 2023, 35 : 1881 - 1897