Inflate and Shrink: Enriching and Reducing Interactions for Fast Text-Image Retrieval

被引：0

作者：

Liu, Haoliang ^{[1
]}

Yu, Tan

Li, Ping

机构：

[1] Baidu Res, Cognt Comp Lab, 10 Xibeiwang East Rd, Beijing 100193, Peoples R China

来源：

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年

关键词：

LANGUAGE; VISION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

By exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.

引用

页码：9796 / 9809

页数：14

共 50 条

[41] CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Long, Zijun
Ge, Xuri
McCreadie, Richard
Jose, Joemon M.
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2188 - 2198
[42] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
He, Liu
Liu, Shuyan
An, Ran
Zhuo, Yudong
Tao, Jian
MATHEMATICS, 2023, 11 (10)
[43] A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing
Zheng, Fuzhong
Li, Weipeng
Wang, Xu
Wang, Luyao
Zhang, Xiong
Zhang, Haisu
APPLIED SCIENCES-BASEL, 2022, 12 (23):
[44] Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval
Li, Zhe
Zhang, Lei
Zhang, Kun
Zhang, Yongdong
Mao, Zhendong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6542 - 6558
[45] VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP
Li, Yikang
Hsiao, Jenhao
Ho, Chiuman
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 29 - 33
[46] A FAST AND ACCURATE METHOD FOR REMOTE SENSING IMAGE-TEXT RETRIEVAL BASED ON LARGE MODEL KNOWLEDGE DISTILLATION
Liao, Yu
Yang, Rui
Xie, Tao
Xing, Hantong
Quan, Dou
Wang, Shuang
Hou, Biao
IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 5077 - 5080
[47] Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment
Zhuang, Jiamin
Yu, Jing
Ding, Yang
Qu, Xiangyan
Hu, Yue
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1361 - 1372
[48] Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment
Zhuang, Jiamin
Yu, Jing
Ding, Yang
Qu, Xiangyan
Hu, Yue
arXiv, 2023,
[49] A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries
Oussama, Aiadi
Khaldi, Belal
Kherfi, Mohammed Lamine
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 10795 - 10812
[50] A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries
Aiadi Oussama
Belal Khaldi
Mohammed Lamine Kherfi
Multimedia Tools and Applications, 2023, 82 : 10795 - 10812

← 1 2 3 4 5 →