Inflate and Shrink: Enriching and Reducing Interactions for Fast Text-Image Retrieval

被引：0

作者：

Liu, Haoliang ^{[1
]}

Yu, Tan

Li, Ping

机构：

[1] Baidu Res, Cognt Comp Lab, 10 Xibeiwang East Rd, Beijing 100193, Peoples R China

来源：

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年

关键词：

LANGUAGE; VISION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

By exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.

引用

页码：9796 / 9809

页数：14

共 50 条

[21] Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering
Yang, Xiaoyu
Li, Chao
Wang, Zhiming
Xie, Hao
Mao, Junyi
Yin, Guangqiang
REMOTE SENSING, 2025, 17 (03)
[22] BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval
Chen, Yinda
Liu, Che
Liu, Xiaoyu
Arcucci, Rossella
Xiong, Zhiwei
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 124 - 134
[23] Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
Xu, Shicheng
Hou, Danyang
Pang, Liang
Deng, Jingcheng
Xu, Jun
Shen, Huawei
Cheng, Xueqi
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 208 - 217
[24] Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval
Liu, Haoyu
Song, Yaoxian
Wang, Xuwu
Zhu, Xiangru
Li, Zhixu
Song, Wei
Lie, Tiefeng
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2024, PT 3, 2025, 14852 : 419 - 434
[25] Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information
Yuan, Zhiqiang
Zhang, Wenkai
Tian, Changyuan
Rong, Xuee
Zhang, Zhengyuan
Wang, Hongqi
Fu, Kun
Sun, Xian
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[26] Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering
Xie, Zhongwei
Liu, Ling
Wu, Yanzhao
Zhong, Luo
Li, Lin
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (04)
[27] CLCP: Realtime Text-Image Retrieval for Retailing via Pre-trained Clustering and Priority Queue
Zhang, Shuyang
Wei, Liangwu
Wang, Qingyu
Wei, Yuntao
Song, Yanzhi
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1089 - 1093
[28] Cross-modal semantic aligning and neighbor-aware completing for robust text-image person retrieval
Gong, Tiantian
Wang, Junsheng
Zhang, Liyan
INFORMATION FUSION, 2024, 112
[29] Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning
Zhang, Weihang
Li, Jihao
Li, Shuoke
Chen, Jialiang
Zhang, Wenkai
Gao, Xin
Sun, Xian
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[30] Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval
Moro, Gianluca
Salvatori, Stefano
Frisoni, Giacomo
NEUROCOMPUTING, 2023, 538

← 1 2 3 4 5 →