LMEye: An Interactive Perception Network for Large Language Models

被引:4
|
作者
Li, Yunxin [1 ]
Hu, Baotian [1 ]
Chen, Xinyu [1 ]
Ma, Lin [2 ]
Xu, Yong [1 ]
Zhang, Min [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518000, Peoples R China
[2] Meituan, Beijing 100102, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Data models; Tuning; Large language models; Training; Cognition; Multimodal large language models (MLLMs); visual-language learning; interactive perception network;
D O I
10.1109/TMM.2024.3428317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.
引用
收藏
页码:10952 / 10964
页数:13
相关论文
共 50 条
  • [1] Chat with the Environment: Interactive Multimodal Perception Using Large Language Models
    Zhao, Xufeng
    Li, Mengdi
    Weber, Cornelius
    Hafez, Muhammad Burhan
    Wermter, Stefan
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 3590 - 3596
  • [2] InteraRec: Interactive Recommendations Using Multimodal Large Language Models
    Karra, Saketh Reddy
    Tulabandhula, Theja
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2024 WORKSHOPS, RAFDA AND IWTA, 2024, 14658 : 32 - 43
  • [3] Recent Advances in Interactive Machine Translation With Large Language Models
    Wang, Yanshu
    Zhang, Jinyi
    Shi, Tianrong
    Deng, Dashuai
    Tian, Ye
    Matsumoto, Tadahiro
    IEEE ACCESS, 2024, 12 : 179353 - 179382
  • [4] Developing an Interactive OpenMP Programming Book with Large Language Models
    Yi, Xinyao
    Wang, Anjia
    Yan, Yonghong
    Liao, Chunhua
    ADVANCING OPENMP FOR FUTURE ACCELERATORS, IWOMP 2024, 2024, 15195 : 176 - 194
  • [5] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
    Zhu, Hongyi
    Huang, Jia-Hong
    Rudinac, Stevan
    Kanoulas, Evangelos
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 978 - 987
  • [6] Navigation Instruction Generation with BEV Perception and Large Language Models
    Fan, Sheng
    Liu, Rui
    Wang, Wenguan
    Yang, Yi
    COMPUTER VISION-ECCV 2024, PT XXII, 2025, 15080 : 368 - 387
  • [7] Toward Interactive Next Location Prediction Driven by Large Language Models
    Chen, Yong
    Chi, Ben
    Li, Chuanjia
    Zhang, Yuliang
    Liao, Chenlei
    Chen, Xiqun
    Xie, Na
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
  • [8] Leveraging Large Language Models for Goal-aware Interactive Recommendations
    Said, Alan
    Willemsen, Martijn
    Marinho, Leandro Balby
    Silva, Itallo
    PROCEEDINGS OF THE 11TH CONFERENCE ON HUMAN-AGENT INTERACTION, HAI 2023, 2023, : 464 - 466
  • [9] Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
    Carta, Thomas
    Romac, Clement
    Wolf, Thomas
    Lamprier, Sylvain
    Sigaud, Olivier
    Oudeyer, Pierre-Yves
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [10] Driving and suppressing the human language network using large language models
    Tuckute, Greta
    Sathe, Aalok
    Srikant, Shashank
    Taliaferro, Maya
    Wang, Mingye
    Schrimpf, Martin
    Kay, Kendrick
    Fedorenko, Evelina
    NATURE HUMAN BEHAVIOUR, 2024, 8 (03) : 544 - 561