Inflate and Shrink: Enriching and Reducing Interactions for Fast Text-Image Retrieval

被引:0
|
作者
Liu, Haoliang [1 ]
Yu, Tan
Li, Ping
机构
[1] Baidu Res, Cognt Comp Lab, 10 Xibeiwang East Rd, Beijing 100193, Peoples R China
关键词
LANGUAGE; VISION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
By exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.
引用
收藏
页码:9796 / 9809
页数:14
相关论文
共 50 条
  • [21] Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering
    Yang, Xiaoyu
    Li, Chao
    Wang, Zhiming
    Xie, Hao
    Mao, Junyi
    Yin, Guangqiang
    REMOTE SENSING, 2025, 17 (03)
  • [22] BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval
    Chen, Yinda
    Liu, Che
    Liu, Xiaoyu
    Arcucci, Rossella
    Xiong, Zhiwei
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI, 2024, 15011 : 124 - 134
  • [23] Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
    Xu, Shicheng
    Hou, Danyang
    Pang, Liang
    Deng, Jingcheng
    Xu, Jun
    Shen, Huawei
    Cheng, Xueqi
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 208 - 217
  • [24] Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval
    Liu, Haoyu
    Song, Yaoxian
    Wang, Xuwu
    Zhu, Xiangru
    Li, Zhixu
    Song, Wei
    Lie, Tiefeng
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2024, PT 3, 2025, 14852 : 419 - 434
  • [25] Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information
    Yuan, Zhiqiang
    Zhang, Wenkai
    Tian, Changyuan
    Rong, Xuee
    Zhang, Zhengyuan
    Wang, Hongqi
    Fu, Kun
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [26] Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering
    Xie, Zhongwei
    Liu, Ling
    Wu, Yanzhao
    Zhong, Luo
    Li, Lin
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (04)
  • [27] CLCP: Realtime Text-Image Retrieval for Retailing via Pre-trained Clustering and Priority Queue
    Zhang, Shuyang
    Wei, Liangwu
    Wang, Qingyu
    Wei, Yuntao
    Song, Yanzhi
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1089 - 1093
  • [28] Cross-modal semantic aligning and neighbor-aware completing for robust text-image person retrieval
    Gong, Tiantian
    Wang, Junsheng
    Zhang, Liyan
    INFORMATION FUSION, 2024, 112
  • [29] Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning
    Zhang, Weihang
    Li, Jihao
    Li, Shuoke
    Chen, Jialiang
    Zhang, Wenkai
    Gao, Xin
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [30] Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval
    Moro, Gianluca
    Salvatori, Stefano
    Frisoni, Giacomo
    NEUROCOMPUTING, 2023, 538