LMEye: An Interactive Perception Network for Large Language Models

被引:4
|
作者
Li, Yunxin [1 ]
Hu, Baotian [1 ]
Chen, Xinyu [1 ]
Ma, Lin [2 ]
Xu, Yong [1 ]
Zhang, Min [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518000, Peoples R China
[2] Meituan, Beijing 100102, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Data models; Tuning; Large language models; Training; Cognition; Multimodal large language models (MLLMs); visual-language learning; interactive perception network;
D O I
10.1109/TMM.2024.3428317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.
引用
收藏
页码:10952 / 10964
页数:13
相关论文
共 50 条
  • [31] Large Language Models
    Cerf, Vinton G.
    COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 7 - 7
  • [32] Toward Reproducing Network Research Results Using Large Language Models
    Xiang, Qiao
    Lin, Yuling
    Fang, Mingjun
    Huang, Bang
    Huang, Siyong
    Wen, Ridi
    Le, Franck
    Kong, Linghe
    Shu, Jiwu
    PROCEEDINGS OF THE 22ND ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2023, 2023, : 56 - 62
  • [33] Deploying Stateful Network Functions Efficiently using Large Language Models
    Ghasemirahni, Hamid
    Farshin, Alireza
    Scazzariello, Mariano
    Chiesa, Marco
    Kostic, Dejan
    PROCEEDINGS OF THE 2024 4TH WORKSHOP ON MACHINE LEARNING AND SYSTEMS, EUROMLSYS 2024, 2024, : 28 - 38
  • [34] Reinforcement Learning With Large Language Models (LLMs) Interaction For Network Services
    Du, Hongyang
    Zhang, Ruichen
    Niyato, Dusit
    Kang, Jiawen
    Xiong, Zehui
    Kim, Dong In
    2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 799 - 803
  • [35] Understanding the Conceptual Structure of Large Language Models through Bibliographical Network
    Duarte-Martinez, V.
    Perez, I. J.
    Ducange, P.
    Cobo, M. J.
    IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS 2024, IEEE EAIS 2024, 2024, : 266 - 272
  • [36] Enhancing Network Management Using Code Generated by Large Language Models
    Mani, Sathiya Kumaran
    Zhou, Yajie
    Hsieh, Kevin
    Segarra, Santiago
    Eberl, Trevor
    Azulai, Eliran
    Frizler, Ido
    Chandra, Ranveer
    Kandula, Srikanth
    PROCEEDINGS OF THE 22ND ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2023, 2023, : 196 - 204
  • [37] Intent-Based Network Configuration Using Large Language Models
    Tu, Nguyen
    Nam, Sukhyun
    Hong, James Won-Ki
    INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2025, 35 (01)
  • [38] A qualitative survey on perception of medical students on the use of large language models for educational purposes
    Mondal, Himel
    Karri, Juhu Kiran Krushna
    Ramasubramanian, Swaminathan
    Mondal, Shaikat
    Juhi, Ayesha
    Gupta, Pratima
    ADVANCES IN PHYSIOLOGY EDUCATION, 2025, 49 (01) : 27 - 36
  • [39] Network models for haptic perception
    Millar, S
    INFANT BEHAVIOR & DEVELOPMENT, 2005, 28 (03): : 250 - 265
  • [40] Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models
    Strobelt H.
    Webson A.
    Sanh V.
    Hoover B.
    Beyer J.
    Pfister H.
    Rush A.M.
    IEEE Transactions on Visualization and Computer Graphics, 2023, 29 (01) : 1146 - 1156