Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

被引:2
|
作者
Zhu, Hongyi [1 ]
Huang, Jia-Hong [1 ]
Rudinac, Stevan [1 ]
Kanoulas, Evangelos [1 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
关键词
Interactive Image Retrieval; Query Rewriting; Vision Language Models; Large Language Models; INFORMATION;
D O I
10.1145/3652583.3658032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
引用
收藏
页码:978 / 987
页数:10
相关论文
共 50 条
  • [31] Retrieval-augmented Recommender System: Enhancing Recommender Systems with Large Language Models
    Di Palma, Dario
    PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023, 2023, : 1369 - 1373
  • [32] Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
    Hakimov, Sherzod
    Schlangen, David
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14196 - 14210
  • [33] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [34] Vision-Language Models in medical image analysis: From simple fusion to general large models
    Li, Xiang
    Li, Like
    Jiang, Yuchen
    Wang, Hao
    Qiao, Xinyu
    Feng, Ting
    Luo, Hao
    Zhao, Yong
    INFORMATION FUSION, 2025, 118
  • [35] Learning the Visualness of Text Using Large Vision-Language Models
    Verma, Gaurav
    Rossi, Ryan A.
    Tensmeyer, Christopher
    Gu, Jiuxiang
    Nenkova, Ani
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
  • [36] Interactive image retrieval by natural language
    Harada, S
    Itoh, Y
    Nakatani, H
    OPTICAL ENGINEERING, 1997, 36 (12) : 3281 - 3287
  • [37] Statistical language models for query-by-example spoken document retrieval
    Paula Lopez-Otero
    Javier Parapar
    Alvaro Barreiro
    Multimedia Tools and Applications, 2020, 79 : 7927 - 7949
  • [38] Statistical query translation models for cross-language information retrieval
    Microsoft Research
    不详
    不详
    不详
    不详
    ACM Trans. Asian Lang. Inf. Process., 2006, 4 (323-359): : 323 - 359
  • [39] Statistical language models for query-by-example spoken document retrieval
    Lopez-Otero, Paula
    Parapar, Javier
    Barreiro, Alvaro
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (11-12) : 7927 - 7949
  • [40] LMEye: An Interactive Perception Network for Large Language Models
    Li, Yunxin
    Hu, Baotian
    Chen, Xinyu
    Ma, Lin
    Xu, Yong
    Zhang, Min
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10952 - 10964