Cross-Modal Retrieval Algorithm for Image and Text Based on Pre-Trained Models and Encoders

被引:0
|
作者
Chen X. [1 ]
Peng J. [1 ]
Zhang P. [1 ]
Luo Z. [2 ]
Ou Z. [2 ]
机构
[1] State Grid Hebei Information and Telecommunication Branch, Shijiazhuang
[2] College of Computer Science, Beijing University of Posts and Telecommunications, Beijing
关键词
cross-modal retrieval algorithm; dual encoders; fusion encoders; pre-trained model;
D O I
10.13190/j.jbupt.2023-146
中图分类号
学科分类号
摘要
At present, the mainstream image-text cross-modal retrieval model architectures are mainly include two kinds of model architectures based on dual encoders andfusion encoders. Although the architecture based on dual encoders has high retrieval efficiency, its accuracy is insufficient. The architecture based on fusion encodershas high retrieval accuracy but low efficiency. In order to solve the problems of the above model architecture, proposes a newcross-modal image retrieval algorithm. Firstly, a recall sequencing strategy is proposed, which uses dual encoder to achieve rough recall and fusion encoder to achieve precise sequencing. Secondly, a method to build dual encoders and fusion encoders based on multi-channel Transformer pre-trained model is proposed to achieve high-quality semantic alignment between texts and images and improve retrieval performance. Experimental results on two public datasets microsoft common objects in context and Flickr30k demonstrate the effectiveness of the proposed algorithm. © 2023 Beijing University of Posts and Telecommunications. All rights reserved.
引用
收藏
页码:112 / 117
页数:5
相关论文
共 20 条
  • [1] RADFORD A, KIM J W, HALLACY C, Et al., Learning transferable visual models from natural language supervision
  • [2] CHEN Y C, LI L J, YU L C, Et al., UNITER: universal image-TExt representation learning
  • [3] ZHENG Z D, ZHENG L, GARRETT M, Et al., Dual-path convolutional image-text embeddings with instance loss, ACM Transactions on Multimedia Computing, Communications and Applications, 16, 2, pp. 244-266, (2020)
  • [4] JIA C, YANG Y F, XIA Y, Et al., Scaling up visual and vision-language representation learning with noisy text supervision
  • [5] LU J S, BATRA D, PARIKH D, Et al., ViLBERT: pretraining task-agnostic visiolinguistic representations for vision- and-language tasks [EB / OL ]
  • [6] GAO D H, JIN L B, CHEN B, Et al., FashionBERT: text and image matching with adaptive loss for cross-modal retrieval, 椅Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251-2260, (2020)
  • [7] SU W J, ZHU X Z, CAO Y, Et al., VL-BERT: pre-training of generic visual-linguistic representations
  • [8] TAN H, BANSAL M., LXMERT: learning cross-modality encoder representations from transformers
  • [9] WANG W H, BAO H B, DONG L, Et al., Image as a foreign language: BEiT pretraining for all vision and vision-language tasks
  • [10] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, Et al., An image is worth 16 伊 16 words: transformers for image recognition at scale