Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

被引:4
|
作者
Li, Zhe [1 ]
Zhang, Lei [2 ]
Zhang, Kun [2 ]
Zhang, Yongdong [2 ,3 ]
Mao, Zhendong [1 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230027, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China
关键词
Image-text retrieval; independent-embedding approaches; cross-modal association graph; memory network; ATTENTION; NETWORK;
D O I
10.1109/TCSVT.2024.3358411
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.
引用
收藏
页码:6542 / 6558
页数:17
相关论文
共 50 条
  • [1] Enhanced Semantic Similarity Learning Framework for Image-Text Matching
    Zhang, Kun
    Hu, Bo
    Zhang, Huatian
    Li, Zhe
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2973 - 2988
  • [2] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
    Feng, Duoduo
    He, Xiangteng
    Peng, Yuxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [3] Compositional Learning of Image-Text Query for Image Retrieval
    Anwaar, Muhammad Umer
    Labintcev, Egor
    Kleinsteuber, Martin
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1139 - 1148
  • [4] Learning Multi-view Embedding in Joint Space for Bidirectional Image-Text Retrieval
    Ran, Lu
    Wang, Wenmin
    2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2017,
  • [5] An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval
    Zhang, Jinzhi
    Wang, Luyao
    Zheng, Fuzhong
    Wang, Xu
    Zhang, Haisu
    REMOTE SENSING, 2024, 16 (12)
  • [6] Action-Aware Embedding Enhancement for Image-Text Retrieval
    Li, Jiangtong
    Niu, Li
    Zhang, Liqing
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1323 - 1331
  • [7] Estimating the Semantics via Sector Embedding for Image-Text Retrieval
    Wang, Zheng
    Gao, Zhenwei
    Han, Mengqun
    Yang, Yang
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10342 - 10353
  • [8] Learning hierarchical embedding space for image-text matching
    Sun, Hao
    Qin, Xiaolin
    Liu, Xiaojing
    INTELLIGENT DATA ANALYSIS, 2024, 28 (03) : 647 - 665
  • [9] Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval
    Seo, Sanghyun
    Kim, Juntae
    PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018), 2018, : 350 - 353
  • [10] Image-Text Cross-Modal Retrieval with Instance Contrastive Embedding
    Zeng, Ruigeng
    Ma, Wentao
    Wu, Xiaoqian
    Liu, Wei
    Liu, Jie
    ELECTRONICS, 2024, 13 (02)