Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

被引:4
|
作者
Li, Zhe [1 ]
Zhang, Lei [2 ]
Zhang, Kun [2 ]
Zhang, Yongdong [2 ,3 ]
Mao, Zhendong [1 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230027, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China
关键词
Image-text retrieval; independent-embedding approaches; cross-modal association graph; memory network; ATTENTION; NETWORK;
D O I
10.1109/TCSVT.2024.3358411
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.
引用
收藏
页码:6542 / 6558
页数:17
相关论文
共 50 条
  • [41] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Lu, Haoyu
    Huo, Yuqi
    Ding, Mingyu
    Fei, Nanyi
    Lu, Zhiwu
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
  • [42] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Haoyu Lu
    Yuqi Huo
    Mingyu Ding
    Nanyi Fei
    Zhiwu Lu
    Machine Intelligence Research, 2023, 20 : 569 - 582
  • [43] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
    Zeng, Sheng
    Liu, Changhong
    Zhou, Jun
    Chen, Yong
    Jiang, Aiwen
    Li, Hanxi
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248
  • [44] Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval
    Yuan, Yuan
    Zhan, Yang
    Xiong, Zhitong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [45] Image-text bidirectional learning network based cross-modal retrieval
    Li, Zhuoyi
    Lu, Huibin
    Fu, Hao
    Gu, Guanghua
    NEUROCOMPUTING, 2022, 483 : 148 - 159
  • [46] Bi-Attention enhanced representation learning for image-text matching
    Tian, Yumin
    Ding, Aqiang
    Wang, Di
    Luo, Xuemei
    Wan, Bo
    Wang, Yifeng
    PATTERN RECOGNITION, 2023, 140
  • [47] Multiscale Salient Alignment Learning for Remote-Sensing Image-Text Retrieval
    Chen, Yaxiong
    Huang, Jinghao
    Li, Xiaoyu
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 13
  • [48] Learning and Integrating Multi-Level Matching Features for Image-Text Retrieval
    Lan, Hong
    Zhang, Pufen
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 374 - 378
  • [49] A memory learning framework for effective image retrieval
    Han, JW
    Ngan, KN
    Li, MJ
    Zhang, HJ
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2005, 14 (04) : 511 - 524
  • [50] CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling
    Gao, Hongyu
    Zhu, Chao
    Liu, Mengyin
    Gu, Weibo
    Wang, Hongfa
    Liu, Wei
    Yin, Xu-Cheng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4957 - 4966