Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

被引：4

作者：

Li, Zhe ^{[1
]}

Zhang, Lei ^{[2
]}

Zhang, Kun ^{[2
]}

Zhang, Yongdong ^{[2
,3
]}

Mao, Zhendong ^{[1
,3
]}

机构：

[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230027, Peoples R China

[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China

[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 07期

关键词：

Image-text retrieval; independent-embedding approaches; cross-modal association graph; memory network; ATTENTION; NETWORK;

D O I：

10.1109/TCSVT.2024.3358411

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.

引用

页码：6542 / 6558

页数：17

共 50 条

[1] Enhanced Semantic Similarity Learning Framework for Image-Text Matching
Zhang, Kun
Hu, Bo
Zhang, Huatian
Li, Zhe
Mao, Zhendong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2973 - 2988
[2] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
Feng, Duoduo
He, Xiangteng
Peng, Yuxin
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
[3] Compositional Learning of Image-Text Query for Image Retrieval
Anwaar, Muhammad Umer
Labintcev, Egor
Kleinsteuber, Martin
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1139 - 1148
[4] Learning Multi-view Embedding in Joint Space for Bidirectional Image-Text Retrieval
Ran, Lu
Wang, Wenmin
2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2017,
[5] An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval
Zhang, Jinzhi
Wang, Luyao
Zheng, Fuzhong
Wang, Xu
Zhang, Haisu
REMOTE SENSING, 2024, 16 (12)
[6] Action-Aware Embedding Enhancement for Image-Text Retrieval
Li, Jiangtong
Niu, Li
Zhang, Liqing
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1323 - 1331
[7] Estimating the Semantics via Sector Embedding for Image-Text Retrieval
Wang, Zheng
Gao, Zhenwei
Han, Mengqun
Yang, Yang
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10342 - 10353
[8] Learning hierarchical embedding space for image-text matching
Sun, Hao
Qin, Xiaolin
Liu, Xiaojing
INTELLIGENT DATA ANALYSIS, 2024, 28 (03) : 647 - 665
[9] Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval
Seo, Sanghyun
Kim, Juntae
PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018), 2018, : 350 - 353
[10] Image-Text Cross-Modal Retrieval with Instance Contrastive Embedding
Zeng, Ruigeng
Ma, Wentao
Wu, Xiaoqian
Liu, Wei
Liu, Jie
ELECTRONICS, 2024, 13 (02)

← 1 2 3 4 5 →