Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

被引:0
|
作者
Schall, Konstantin [1 ]
Barthel, Kai Uwe [1 ]
Hezel, Nico [1 ]
Jung, Klaus [1 ]
机构
[1] HTW Berlin, Visual Comp Grp, D-12459 Berlin, Germany
关键词
Multi-modal similarity search; Content-based image retrieval; Representations learning for general-purpose feature extraction;
D O I
10.1007/978-3-031-75823-2_9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval using only one embedding per image.
引用
收藏
页码:97 / 110
页数:14
相关论文
共 21 条
  • [1] Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval
    Zhang, Feifei
    Xu, Mingliang
    Mao, Qirong
    Xu, Changsheng
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3367 - 3376
  • [2] Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval
    Ge, Xuri
    Chen, Fuhai
    Jose, Joemon M.
    Ji, Zhilong
    Wu, Zhongqin
    Liu, Xiao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5185 - 5193
  • [3] Data driven image models through continuous joint alignment
    Learned-Miller, EG
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2006, 28 (02) : 236 - 250
  • [4] Cross-Modal Joint Prediction and Alignment for Composed Query Image Retrieval
    Yang, Yuchen
    Wang, Min
    Zhou, Wengang
    Li, Houqiang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3303 - 3311
  • [5] MUTUALLY COHERENT STRUCTURAL REPRESENTATION FOR IMAGE REGISTRATION THROUGH JOINT MANIFOLD EMBEDDING AND ALIGNMENT
    Conjeti, Sailesh
    Yigitsoy, Mehmet
    Sheet, Debdoot
    Chatterjee, Jyotirmoy
    Navab, Nassir
    Katouzian, Amin
    2015 IEEE 12TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI), 2015, : 601 - 604
  • [6] Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Panda, Rameswar
    Papalexakis, Evangelos E.
    Roy-Chowdhury, Amit K.
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1856 - 1864
  • [7] Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network
    Huang, Ziling
    Satoh, Shin'ichi
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7753 - 7762
  • [8] Learning Multi-view Embedding in Joint Space for Bidirectional Image-Text Retrieval
    Ran, Lu
    Wang, Wenmin
    2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2017,
  • [9] Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval
    Li, Jiangtong
    Liu, Liu
    Niu, Li
    Zhang, Liqing
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 (30) : 9193 - 9207
  • [10] Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service
    Xie, Zhongwei
    Liu, Ling
    Wu, Yanzhao
    Li, Lin
    Zhong, Luo
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (06) : 3304 - 3316