Inflate and Shrink: Enriching and Reducing Interactions for Fast Text-Image Retrieval

被引:0
|
作者
Liu, Haoliang [1 ]
Yu, Tan
Li, Ping
机构
[1] Baidu Res, Cognt Comp Lab, 10 Xibeiwang East Rd, Beijing 100193, Peoples R China
关键词
LANGUAGE; VISION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
By exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.
引用
收藏
页码:9796 / 9809
页数:14
相关论文
共 50 条
  • [41] CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
    Long, Zijun
    Ge, Xuri
    McCreadie, Richard
    Jose, Joemon M.
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2188 - 2198
  • [42] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
    He, Liu
    Liu, Shuyan
    An, Ran
    Zhuo, Yudong
    Tao, Jian
    MATHEMATICS, 2023, 11 (10)
  • [43] A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing
    Zheng, Fuzhong
    Li, Weipeng
    Wang, Xu
    Wang, Luyao
    Zhang, Xiong
    Zhang, Haisu
    APPLIED SCIENCES-BASEL, 2022, 12 (23):
  • [44] Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval
    Li, Zhe
    Zhang, Lei
    Zhang, Kun
    Zhang, Yongdong
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6542 - 6558
  • [45] VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 29 - 33
  • [46] A FAST AND ACCURATE METHOD FOR REMOTE SENSING IMAGE-TEXT RETRIEVAL BASED ON LARGE MODEL KNOWLEDGE DISTILLATION
    Liao, Yu
    Yang, Rui
    Xie, Tao
    Xing, Hantong
    Quan, Dou
    Wang, Shuang
    Hou, Biao
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 5077 - 5080
  • [47] Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment
    Zhuang, Jiamin
    Yu, Jing
    Ding, Yang
    Qu, Xiangyan
    Hu, Yue
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1361 - 1372
  • [48] Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment
    Zhuang, Jiamin
    Yu, Jing
    Ding, Yang
    Qu, Xiangyan
    Hu, Yue
    arXiv, 2023,
  • [49] A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries
    Oussama, Aiadi
    Khaldi, Belal
    Kherfi, Mohammed Lamine
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 10795 - 10812
  • [50] A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries
    Aiadi Oussama
    Belal Khaldi
    Mohammed Lamine Kherfi
    Multimedia Tools and Applications, 2023, 82 : 10795 - 10812