Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

被引:0
|
作者
Lu, Yuhang [1 ,2 ]
Yu, Jing [1 ]
Liu, Yanbing [1 ]
Tan, Jianlong [1 ]
Guo, Li [1 ]
Zhang, Weifeng [3 ,4 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
[4] Zhejiang Future Technol Inst, Jiaxing, Peoples R China
关键词
Stacked co-attention network; Graph convolution; Fine-grained cross-modal correlation;
D O I
10.1007/978-3-319-99365-2_19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval provides a flexible way to find semantically relevant information across different modalities given a query of one modality. The main challenge is to measure the similarity between different modalities of data. Generally, different modalities contain unequal amount of information when describing the same semantics. For example, textual descriptions often contain more background information that cannot be conveyed by images and vice versa. Existing works mostly map the global data features from different modalities to a common semantic space to measure their similarity, which ignore their imbalanced and complementary relationships. In this paper, we propose stacked co-attention networks (SCANet) to progressively learn the mutually attended features of different modalities and leverage these fine-grained correlations to enhance cross-modal retrieval performance. SCANet adopts a dual-path end-to-end framework to jointly learn the multimodal representations, stacked co-attention, and similarity metric. Experiment results on three widely-used benchmark datasets verify that SCANet outperforms state-of-the-art methods, with 19% improvements on MAP in average for the best case.
引用
收藏
页码:213 / 225
页数:13
相关论文
共 50 条
  • [31] Progressive Co-Attention Network for Fine-Grained Visual Classification
    Zhang, Tian
    Chang, Dongliang
    Ma, Zhanyu
    Guo, Jun
    2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [32] Cross-modal knowledge learning with scene text for fine-grained image classification
    Xiong, Li
    Mao, Yingchi
    Wang, Zicheng
    Nie, Bingbing
    Li, Chang
    IET IMAGE PROCESSING, 2024, 18 (06) : 1447 - 1459
  • [33] Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval
    Han, Lijun
    Wang, Renlin
    Chen, Chunlei
    Zhang, Huihui
    Zhang, Yujie
    Zhang, Wenfeng
    IEEE ACCESS, 2024, 12 : 31756 - 31770
  • [34] Modal Invariance Feature Learning and Consistent Fine-Grained Information Mining Based Cross-Modal Person Re-identification
    Shi, Linbo
    Li, Huafeng
    Zhang, Yafei
    Xie, Minghong
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2022, 35 (12): : 1064 - 1077
  • [35] TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval
    Li, Qiqi
    Ma, Longfei
    Jiang, Zheng
    Li, Mingyong
    Jin, Bo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 3713 - 3728
  • [36] Multi-grained Representation Learning for Cross-modal Retrieval
    Zhao, Shengwei
    Xu, Linhai
    Liu, Yuying
    Du, Shaoyi
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2194 - 2198
  • [37] Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment
    Jin, Seungwan
    Choi, Hoyoung
    Noh, Taehyung
    Han, Kyungsik
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 53 - 70
  • [38] Cross-Modal Fine-Grained Interaction Fusion in Fake News Detection
    Che, Zhanbin
    Cui, GuangBo
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (05) : 945 - 956
  • [39] DCMA-Net: dual cross-modal attention for fine-grained few-shot recognition
    Zhou, Yan
    Ren, Xiao
    Li, Jianxun
    Yang, Yin
    Zhou, Haibin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (05) : 14521 - 14537
  • [40] DCMA-Net: dual cross-modal attention for fine-grained few-shot recognition
    Yan Zhou
    Xiao Ren
    Jianxun Li
    Yin Yang
    Haibin Zhou
    Multimedia Tools and Applications, 2024, 83 : 14521 - 14537