Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

被引：2

作者：

Zhu, Hongyi ^{[1
]}

Huang, Jia-Hong ^{[1
]}

Rudinac, Stevan ^{[1
]}

Kanoulas, Evangelos ^{[1
]}

机构：

[1] Univ Amsterdam, Amsterdam, Netherlands

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

关键词：

Interactive Image Retrieval; Query Rewriting; Vision Language Models; Large Language Models; INFORMATION;

D O I：

10.1145/3652583.3658032

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

引用

页码：978 / 987

页数：10

共 50 条

[31] Retrieval-augmented Recommender System: Enhancing Recommender Systems with Large Language Models
Di Palma, Dario
PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023, 2023, : 1369 - 1373
[32] Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Hakimov, Sherzod
Schlangen, David
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14196 - 14210
[33] Large Language Models are Not Models of Natural Language: They are Corpus Models
Veres, Csaba
IEEE ACCESS, 2022, 10 : 61970 - 61979
[34] Vision-Language Models in medical image analysis: From simple fusion to general large models
Li, Xiang
Li, Like
Jiang, Yuchen
Wang, Hao
Qiao, Xinyu
Feng, Ting
Luo, Hao
Zhao, Yong
INFORMATION FUSION, 2025, 118
[35] Learning the Visualness of Text Using Large Vision-Language Models
Verma, Gaurav
Rossi, Ryan A.
Tensmeyer, Christopher
Gu, Jiuxiang
Nenkova, Ani
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
[36] Interactive image retrieval by natural language
Harada, S
Itoh, Y
Nakatani, H
OPTICAL ENGINEERING, 1997, 36 (12) : 3281 - 3287
[37] Statistical language models for query-by-example spoken document retrieval
Paula Lopez-Otero
Javier Parapar
Alvaro Barreiro
Multimedia Tools and Applications, 2020, 79 : 7927 - 7949
[38] Statistical query translation models for cross-language information retrieval
Microsoft Research
不详
不详
不详
不详
ACM Trans. Asian Lang. Inf. Process., 2006, 4 (323-359): : 323 - 359
[39] Statistical language models for query-by-example spoken document retrieval
Lopez-Otero, Paula
Parapar, Javier
Barreiro, Alvaro
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (11-12) : 7927 - 7949
[40] LMEye: An Interactive Perception Network for Large Language Models
Li, Yunxin
Hu, Baotian
Chen, Xinyu
Ma, Lin
Xu, Yong
Zhang, Min
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10952 - 10964

← 1 2 3 4 5 →