Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

被引：4

作者：

Baldrati, Alberto ^{[1
,2
]}

Bertini, Marco ^{[1
]}

Uricchio, Tiberio ^{[3
]}

Del Bimbo, Alberto ^{[1
]}

机构：

[1] Univ Firenze, Viale Morgagni 65, I-50124 Florence, Italy

[2] Univ Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy

[3] Univ Macerata, Via Garibaldi 20, I-62100 Macerata, Italy

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 03期

基金：

欧盟地平线“2020”;

关键词：

Multimodal retrieval; combiner networks; vision language model;

D O I：

10.1145/3617597

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.

引用

页数：24

共 50 条

[31] Multimodal Depression Detection Using Task-oriented Transformer-based Embedding
Rasipuram, Sowmya
Bhat, Junaid Hamid
Maitra, Anutosh
Shaw, Bishal
Saha, Sriparna
2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
[32] Deep Reinforcement Learning Based Task-Oriented Communication in Multi-Agent Systems
He, Guojun
Feng, Mingjie
Zhang, Yu
Liu, Guanghua
Dai, Yueyue
Jiang, Tao
IEEE WIRELESS COMMUNICATIONS, 2023, 30 (03) : 112 - 119
[33] Reactive Task-oriented Redundancy Resolution using Constraint-Based Programming
Wang, Yuquan
Wang, Lihui
2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 5689 - 5694
[34] Adaptive Task-Oriented Chatbots Using Feature-Based Knowledge Bases
Campas, Carla
Motger, Quim
Franch, Xavier
Marco, Jordi
INTELLIGENT INFORMATION SYSTEMS, CAISE FORUM 2023, 2023, 477 : 95 - 102
[35] NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-Based Simulation
Kim, Sungdong
Chang, Minsuk
Lee, Sang-Woo
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3704 - 3717
[36] Task-oriented machine learning surrogates for tipping points of agent-based models
Fabiani, Gianluca
Evangelou, Nikolaos
Cui, Tianqi
Bello-Rivas, Juan M.
Martin-Linares, Cristina P.
Siettos, Constantinos
Kevrekidis, Ioannis G.
NATURE COMMUNICATIONS, 2024, 15 (01)
[37] Painless and accurate medical image analysis using deep reinforcement learning with task-oriented homogenized automatic pre-processing
Yuan, Di
Liu, Yunxin
Xu, Zhenghua
Zhan, Yuefu
Chen, Junyang
Lukasiewicz, Thomas
COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 153
[38] ACO-Based Scheme in Edge Learning NOMA Networks for Task-Oriented Communications
Garcia, Carla E.
Camana, Mario R.
Koo, Insoo
IEEE ACCESS, 2024, 12 : 37692 - 37701
[39] Surface-based geometric modeling using task-oriented teaching trees
Nakamura, A
Ogasawara, T
Tsukune, H
Oshima, M
IROS 96 - PROCEEDINGS OF THE 1996 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS - ROBOTIC INTELLIGENCE INTERACTING WITH DYNAMIC WORLDS, VOLS 1-3, 1996, : 1015 - 1022
[40] Contrastive Learning based Multi-task Network for Image Manipulation Detection
Yin, Qilin
Wang, Jinwei
Lu, Wei
Luo, Xiangyang
SIGNAL PROCESSING, 2022, 201

← 1 2 3 4 5 →