Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

被引：4

作者：

Li, Zhe ^{[1
]}

Zhang, Lei ^{[2
]}

Zhang, Kun ^{[2
]}

Zhang, Yongdong ^{[2
,3
]}

Mao, Zhendong ^{[1
,3
]}

机构：

[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230027, Peoples R China

[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China

[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

关键词：

Image-text retrieval; independent-embedding approaches; cross-modal association graph; memory network; ATTENTION; NETWORK;

D O I：

10.1109/TCSVT.2024.3358411

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.

引用

页码：6542 / 6558

页数：17

共 50 条

[41] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Lu, Haoyu
Huo, Yuqi
Ding, Mingyu
Fei, Nanyi
Lu, Zhiwu
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
[42] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Haoyu Lu
Yuqi Huo
Mingyu Ding
Nanyi Fei
Zhiwu Lu
Machine Intelligence Research, 2023, 20 : 569 - 582
[43] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
Zeng, Sheng
Liu, Changhong
Zhou, Jun
Chen, Yong
Jiang, Aiwen
Li, Hanxi
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248
[44] Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval
Yuan, Yuan
Zhan, Yang
Xiong, Zhitong
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[45] Image-text bidirectional learning network based cross-modal retrieval
Li, Zhuoyi
Lu, Huibin
Fu, Hao
Gu, Guanghua
NEUROCOMPUTING, 2022, 483 : 148 - 159
[46] Bi-Attention enhanced representation learning for image-text matching
Tian, Yumin
Ding, Aqiang
Wang, Di
Luo, Xuemei
Wan, Bo
Wang, Yifeng
PATTERN RECOGNITION, 2023, 140
[47] Multiscale Salient Alignment Learning for Remote-Sensing Image-Text Retrieval
Chen, Yaxiong
Huang, Jinghao
Li, Xiaoyu
Xiong, Shengwu
Lu, Xiaoqiang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 13
[48] Learning and Integrating Multi-Level Matching Features for Image-Text Retrieval
Lan, Hong
Zhang, Pufen
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 374 - 378
[49] A memory learning framework for effective image retrieval
Han, JW
Ngan, KN
Li, MJ
Zhang, HJ
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2005, 14 (04) : 511 - 524
[50] CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling
Gao, Hongyu
Zhu, Chao
Liu, Mengyin
Gu, Weibo
Wang, Hongfa
Liu, Wei
Yin, Xu-Cheng
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4957 - 4966

← 1 2 3 4 5 →