共 20 条
- [1] RADFORD A, KIM J W, HALLACY C, Et al., Learning transferable visual models from natural language supervision
- [2] CHEN Y C, LI L J, YU L C, Et al., UNITER: universal image-TExt representation learning
- [3] ZHENG Z D, ZHENG L, GARRETT M, Et al., Dual-path convolutional image-text embeddings with instance loss, ACM Transactions on Multimedia Computing, Communications and Applications, 16, 2, pp. 244-266, (2020)
- [4] JIA C, YANG Y F, XIA Y, Et al., Scaling up visual and vision-language representation learning with noisy text supervision
- [5] LU J S, BATRA D, PARIKH D, Et al., ViLBERT: pretraining task-agnostic visiolinguistic representations for vision- and-language tasks [EB / OL ]
- [6] GAO D H, JIN L B, CHEN B, Et al., FashionBERT: text and image matching with adaptive loss for cross-modal retrieval, 椅Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251-2260, (2020)
- [7] SU W J, ZHU X Z, CAO Y, Et al., VL-BERT: pre-training of generic visual-linguistic representations
- [8] TAN H, BANSAL M., LXMERT: learning cross-modality encoder representations from transformers
- [9] WANG W H, BAO H B, DONG L, Et al., Image as a foreign language: BEiT pretraining for all vision and vision-language tasks
- [10] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, Et al., An image is worth 16 伊 16 words: transformers for image recognition at scale