LMEye: An Interactive Perception Network for Large Language Models

被引:4
|
作者
Li, Yunxin [1 ]
Hu, Baotian [1 ]
Chen, Xinyu [1 ]
Ma, Lin [2 ]
Xu, Yong [1 ]
Zhang, Min [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518000, Peoples R China
[2] Meituan, Beijing 100102, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Data models; Tuning; Large language models; Training; Cognition; Multimodal large language models (MLLMs); visual-language learning; interactive perception network;
D O I
10.1109/TMM.2024.3428317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.
引用
收藏
页码:10952 / 10964
页数:13
相关论文
共 50 条
  • [41] INTERACTIVE CLINICAL GUIDELINES WITH LARGE LANGUAGE MODELS: THE GUTGPT SERIES ON AMERICAN GASTROENTEROLOGY ASSOCIATION GUIDELINES
    Giuffre, Mauro
    Shung, Dennis
    GASTROENTEROLOGY, 2024, 166 (05) : S892 - S893
  • [42] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach
    Lee, Saehyung
    Yu, Sangwon
    Park, Junsung
    Yi, Jihun
    Yoon, Sungroh
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 791 - 809
  • [43] The HALLMARK Efect: Supporting Provenance and Transparent Use of Large Language Models in Writing with Interactive Visualization
    Hoque, Md Naimul
    Mashiat, Tasfa
    Ghai, Bhavya
    Shelton, Cecilia
    Chevalier, Fanny
    Kraus, Kari
    Elmqvist, Niklas
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,
  • [44] ChatGPT and Other Large Language Models as Evolutionary Engines for Online Interactive Collaborative Game Design
    Lanzi, Pier Luca
    Loiacono, Daniele
    PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2023, 2023, : 1383 - 1390
  • [45] Interactive display of large NURBS models
    Kumar, S
    Manocha, D
    Lastra, A
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 1996, 2 (04) : 323 - 336
  • [46] Interactive display of large NURBS models
    Johns Hopkins Univ, Baltimore, United States
    IEEE Trans Visual Comput Graphics, 4 (323-336):
  • [47] Large Language Models in der WissenschaftLarge language models in science
    Karl-Friedrich Kowalewski
    Severin Rodler
    Die Urologie, 2024, 63 (9) : 860 - 866
  • [48] Constrained Language Models for Interactive Poem Generation
    Popescu-Belis, Andrei
    Atrio, Alex R.
    Minder, Valentin
    Xanthos, Aris
    Luthier, Gabriel
    Mattei, Simon
    Rodriguez, Antonio
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3519 - 3529
  • [49] A speech planning network for interactive language use
    Castellucci, Gregg A.
    Kovach, Christopher K.
    Howard, Matthew A., III
    Greenlee, Jeremy D. W.
    Long, Michael A.
    NATURE, 2022, 602 (7895) : 117 - +
  • [50] A speech planning network for interactive language use
    Gregg A. Castellucci
    Christopher K. Kovach
    Matthew A. Howard
    Jeremy D. W. Greenlee
    Michael A. Long
    Nature, 2022, 602 : 117 - 122