Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

被引：4

作者：

Baldrati, Alberto ^{[1
,2
]}

Bertini, Marco ^{[1
]}

Uricchio, Tiberio ^{[3
]}

Del Bimbo, Alberto ^{[1
]}

机构：

[1] Univ Firenze, Viale Morgagni 65, I-50124 Florence, Italy

[2] Univ Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy

[3] Univ Macerata, Via Garibaldi 20, I-62100 Macerata, Italy

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 03期

基金：

欧盟地平线“2020”;

关键词：

Multimodal retrieval; combiner networks; vision language model;

D O I：

10.1145/3617597

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.

引用

页数：24

共 50 条

[1] Effective conditioned and composed image retrieval combining CLIP-based features
Baldrati, Alberto
Bertini, Marco
Uricchio, Tiberio
Del Bimbo, Alberto
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21434 - 21442
[2] Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features
Baldrati, Alberto
Bertini, Marco
Uricchio, Tiberio
Del Bimbo, Alberto
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4955 - 4964
[3] CLIP-Based Composed Image Retrieval with Comprehensive Fusion and Data Augmentation
Lin, Haoqiang
Wen, Haokun
Chen, Xiaolin
Song, Xuemeng
ADVANCES IN ARTIFICIAL INTELLIGENCE, AI 2023, PT I, 2024, 14471 : 190 - 202
[4] Task-oriented contrastive learning for unsupervised domain adaptation
Wei, Xing
Wen, Bin
Yang, Fan
Liu, Yujie
Zhao, Chong
Hu, Di
Luo, Hui
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 229
[5] Task-Oriented Koopman-Based Control with Contrastive Encoder
Lyu, Xubo
Hu, Hanyang
Siriya, Seth
Pu, Ye
Chen, Mo
CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
[6] CLIP-Based Grid Features and Masking for Remote Sensing Image Captioning
Lin, Qiaoling
Wang, Shuang
Ye, Xiutiao
Wang, Ruixuan
Yang, Rui
Jiao, Licheng
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 2631 - 2642
[7] A Task-oriented Chatbot Based on LSTM and Reinforcement Learning
Hsueh, Yu-Ling
Chou, Tai-Liang
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (01)
[8] A Task-oriented Chatbot Based on LSTM and Reinforcement Learning
Chou, Tai-Liang
Hsueh, Yu-Ling
NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, 2019, : 87 - 91
[9] Task-oriented Dialogue System Based on Reinforcement Learning
Song, Meina
Chen, Zhongfu
Niu, Peiqing
Haihong, E.
PROCEEDINGS OF 2019 IEEE 10TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2019), 2019, : 93 - 98
[10] A Survey of Task-Oriented Dialogue Policies Based on Reinforcement Learning
Xu K.
Wang Z.-Y.
Wang X.
Qin H.
Long Y.-X.
Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (06): : 1201 - 1231

← 1 2 3 4 5 →