LMEye: An Interactive Perception Network for Large Language Models

被引：4

作者：

Li, Yunxin ^{[1
]}

Hu, Baotian ^{[1
]}

Chen, Xinyu ^{[1
]}

Ma, Lin ^{[2
]}

Xu, Yong ^{[1
]}

Zhang, Min ^{[1
]}

机构：

[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518000, Peoples R China

[2] Meituan, Beijing 100102, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Task analysis; Data models; Tuning; Large language models; Training; Cognition; Multimodal large language models (MLLMs); visual-language learning; interactive perception network;

D O I：

10.1109/TMM.2024.3428317

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.

引用

页码：10952 / 10964

页数：13

共 50 条

[1] Chat with the Environment: Interactive Multimodal Perception Using Large Language Models
Zhao, Xufeng
Li, Mengdi
Weber, Cornelius
Hafez, Muhammad Burhan
Wermter, Stefan
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 3590 - 3596
[2] InteraRec: Interactive Recommendations Using Multimodal Large Language Models
Karra, Saketh Reddy
Tulabandhula, Theja
TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2024 WORKSHOPS, RAFDA AND IWTA, 2024, 14658 : 32 - 43
[3] Recent Advances in Interactive Machine Translation With Large Language Models
Wang, Yanshu
Zhang, Jinyi
Shi, Tianrong
Deng, Dashuai
Tian, Ye
Matsumoto, Tadahiro
IEEE ACCESS, 2024, 12 : 179353 - 179382
[4] Developing an Interactive OpenMP Programming Book with Large Language Models
Yi, Xinyao
Wang, Anjia
Yan, Yonghong
Liao, Chunhua
ADVANCING OPENMP FOR FUTURE ACCELERATORS, IWOMP 2024, 2024, 15195 : 176 - 194
[5] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Zhu, Hongyi
Huang, Jia-Hong
Rudinac, Stevan
Kanoulas, Evangelos
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 978 - 987
[6] Navigation Instruction Generation with BEV Perception and Large Language Models
Fan, Sheng
Liu, Rui
Wang, Wenguan
Yang, Yi
COMPUTER VISION-ECCV 2024, PT XXII, 2025, 15080 : 368 - 387
[7] Toward Interactive Next Location Prediction Driven by Large Language Models
Chen, Yong
Chi, Ben
Li, Chuanjia
Zhang, Yuliang
Liao, Chenlei
Chen, Xiqun
Xie, Na
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
[8] Leveraging Large Language Models for Goal-aware Interactive Recommendations
Said, Alan
Willemsen, Martijn
Marinho, Leandro Balby
Silva, Itallo
PROCEEDINGS OF THE 11TH CONFERENCE ON HUMAN-AGENT INTERACTION, HAI 2023, 2023, : 464 - 466
[9] Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
Carta, Thomas
Romac, Clement
Wolf, Thomas
Lamprier, Sylvain
Sigaud, Olivier
Oudeyer, Pierre-Yves
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
[10] Driving and suppressing the human language network using large language models
Tuckute, Greta
Sathe, Aalok
Srikant, Shashank
Taliaferro, Maya
Wang, Mingye
Schrimpf, Martin
Kay, Kendrick
Fedorenko, Evelina
NATURE HUMAN BEHAVIOUR, 2024, 8 (03) : 544 - 561

← 1 2 3 4 5 →