Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

被引:4
|
作者
Li, Zhe [1 ]
Zhang, Lei [2 ]
Zhang, Kun [2 ]
Zhang, Yongdong [2 ,3 ]
Mao, Zhendong [1 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230027, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China
关键词
Image-text retrieval; independent-embedding approaches; cross-modal association graph; memory network; ATTENTION; NETWORK;
D O I
10.1109/TCSVT.2024.3358411
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.
引用
收藏
页码:6542 / 6558
页数:17
相关论文
共 50 条
  • [31] Memory-enhanced text style transfer with dynamic style learning and calibration
    Fuqiang LIN
    Yiping SONG
    Zhiliang TIAN
    Wangqun CHEN
    Diwen DONG
    Bo LIU
    Science China(Information Sciences), 2024, 67 (04) : 181 - 196
  • [32] Memory-enhanced text style transfer with dynamic style learning and calibration
    Fuqiang Lin
    Yiping Song
    Zhiliang Tian
    Wangqun Chen
    Diwen Dong
    Bo Liu
    Science China Information Sciences, 2024, 67
  • [33] Visual context learning based on textual knowledge for image-text retrieval
    Qin, Yuzhuo
    Gu, Xiaodong
    Tan, Zhenshan
    NEURAL NETWORKS, 2022, 152 : 434 - 449
  • [34] Memory-enhanced text style transfer with dynamic style learning and calibration
    Lin, Fuqiang
    Song, Yiping
    Tian, Zhiliang
    Chen, Wangqun
    Dong, Diwen
    Liu, Bo
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (04)
  • [35] Review of Recent Deep Learning Based Methods for Image-Text Retrieval
    Chen, Jianan
    Zhang, Lu
    Bai, Cong
    Kpalma, Kidiyo
    THIRD INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2020), 2020, : 171 - 176
  • [36] HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval
    Wang, Shuhuai
    Liu, Zheng
    Pei, Xinlei
    Xu, Junhao
    SENSORS, 2023, 23 (05)
  • [37] CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval
    Wen, Xin
    Han, Zhizhong
    Liu, Yu-Shen
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) : 2427 - 2437
  • [38] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
    Wei, Kaimin
    Zhou, Zhibo
    IEEE ACCESS, 2020, 8 (08): : 96237 - 96248
  • [39] Multi-view visual semantic embedding for cross-modal image-text retrieval
    Li, Zheng
    Guo, Caili
    Wang, Xin
    Zhang, Hao
    Hu, Lin
    PATTERN RECOGNITION, 2025, 159
  • [40] Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval
    Yang C.
    Liu L.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (05): : 751 - 759