Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

被引：0

作者：

Schall, Konstantin ^{[1
]}

Barthel, Kai Uwe ^{[1
]}

Hezel, Nico ^{[1
]}

Jung, Klaus ^{[1
]}

机构：

[1] HTW Berlin, Visual Comp Grp, D-12459 Berlin, Germany

来源：

SIMILARITY SEARCH AND APPLICATIONS, SISAP 2024 | 2025年 / 15268卷

关键词：

Multi-modal similarity search; Content-based image retrieval; Representations learning for general-purpose feature extraction;

D O I：

10.1007/978-3-031-75823-2_9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval using only one embedding per image.

引用

页码：97 / 110

页数：14