LMEye: An Interactive Perception Network for Large Language Models

被引：4

作者：

Li, Yunxin ^{[1
]}

Hu, Baotian ^{[1
]}

Chen, Xinyu ^{[1
]}

Ma, Lin ^{[2
]}

Xu, Yong ^{[1
]}

Zhang, Min ^{[1
]}

机构：

[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518000, Peoples R China

[2] Meituan, Beijing 100102, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Task analysis; Data models; Tuning; Large language models; Training; Cognition; Multimodal large language models (MLLMs); visual-language learning; interactive perception network;

D O I：

10.1109/TMM.2024.3428317

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.

引用

页码：10952 / 10964

页数：13

共 50 条

[31] Large Language Models
Cerf, Vinton G.
COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 7 - 7
[32] Toward Reproducing Network Research Results Using Large Language Models
Xiang, Qiao
Lin, Yuling
Fang, Mingjun
Huang, Bang
Huang, Siyong
Wen, Ridi
Le, Franck
Kong, Linghe
Shu, Jiwu
PROCEEDINGS OF THE 22ND ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2023, 2023, : 56 - 62
[33] Deploying Stateful Network Functions Efficiently using Large Language Models
Ghasemirahni, Hamid
Farshin, Alireza
Scazzariello, Mariano
Chiesa, Marco
Kostic, Dejan
PROCEEDINGS OF THE 2024 4TH WORKSHOP ON MACHINE LEARNING AND SYSTEMS, EUROMLSYS 2024, 2024, : 28 - 38
[34] Reinforcement Learning With Large Language Models (LLMs) Interaction For Network Services
Du, Hongyang
Zhang, Ruichen
Niyato, Dusit
Kang, Jiawen
Xiong, Zehui
Kim, Dong In
2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 799 - 803
[35] Understanding the Conceptual Structure of Large Language Models through Bibliographical Network
Duarte-Martinez, V.
Perez, I. J.
Ducange, P.
Cobo, M. J.
IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS 2024, IEEE EAIS 2024, 2024, : 266 - 272
[36] Enhancing Network Management Using Code Generated by Large Language Models
Mani, Sathiya Kumaran
Zhou, Yajie
Hsieh, Kevin
Segarra, Santiago
Eberl, Trevor
Azulai, Eliran
Frizler, Ido
Chandra, Ranveer
Kandula, Srikanth
PROCEEDINGS OF THE 22ND ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2023, 2023, : 196 - 204
[37] Intent-Based Network Configuration Using Large Language Models
Tu, Nguyen
Nam, Sukhyun
Hong, James Won-Ki
INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2025, 35 (01)
[38] A qualitative survey on perception of medical students on the use of large language models for educational purposes
Mondal, Himel
Karri, Juhu Kiran Krushna
Ramasubramanian, Swaminathan
Mondal, Shaikat
Juhi, Ayesha
Gupta, Pratima
ADVANCES IN PHYSIOLOGY EDUCATION, 2025, 49 (01) : 27 - 36
[39] Network models for haptic perception
Millar, S
INFANT BEHAVIOR & DEVELOPMENT, 2005, 28 (03): : 250 - 265
[40] Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models
Strobelt H.
Webson A.
Sanh V.
Hoover B.
Beyer J.
Pfister H.
Rush A.M.
IEEE Transactions on Visualization and Computer Graphics, 2023, 29 (01) : 1146 - 1156

← 1 2 3 4 5 →