Merlin: Empowering Multimodal LLMs with Foresight Minds

被引：0

作者：

Yu, En ^{[1
]}

Zhao, Liang ^{[2
]}

Wei, Yana ^{[3
]}

Yang, Jinrong ^{[3
]}

Wu, Dongming ^{[4
]}

Kong, Lingyu ^{[5
]}

Wei, Haoran ^{[2
]}

Wang, Tiancai ^{[2
]}

Ge, Zheng ^{[2
]}

Zhang, Xiangyu ^{[2
]}

Tao, Wenbing ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] MEGVII Technol, Beijing, Peoples R China

[3] ShanghaiTech Univ, Shanghai, Peoples R China

[4] Beijing Inst Technol, Beijing, Peoples R China

[5] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

COMPUTER VISION-ECCV 2024, PT IV | 2025年 / 15062卷

基金：

中国国家自然科学基金;

关键词：

Multimodal Large Language Model; Future Reasoning; HUMAN BRAIN;

D O I：

10.1007/978-3-031-73235-5_24

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin's foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https://ahnsun.github.io/merlin.

引用

页码：425 / 443

页数：19

共 50 条

[21] Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation
Chen, Hailin
Saha, Amrita
Hoi, Steven
Joty, Shafiq
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6737 - 6749
[22] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
You, Keen
Zhang, Haotian
Schoop, Eldon
Weers, Floris
Swearngin, Amanda
Nichols, Jeffrey
Yang, Yinfei
Gan, Zhe
COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 240 - 255
[23] Multimodal LLMs Struggle with Basic Visual Network Analysis: A VNA Benchmark
Williams, Evan M.
Carley, Kathleen M.
SOCIAL, CULTURAL, AND BEHAVIORAL MODELING, SBP-BRIMS 2024, 2024, 14972 : 15 - 24
[24] Instruction Tuning-Free Visual Token Complement for Multimodal LLMs
Wang, Dongsheng
Cui, Jiequan
Li, Miaoge
Lin, Wang
Chen, Bo
Zhang, Hanwang
COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 446 - 462
[25] Empowering minds and bodies: education as the beacon in Moria's health crisis
Silaban, Ricky Alfredo
Mahulae, Parno Sumanro
Fitrianingrum, Aufa Maulida
Kamaruddin
Sudirham
JOURNAL OF PUBLIC HEALTH, 2023,
[26] The Healthy Bodies and Healthy Minds program: Empowering children to make changes
Vrailas, Bateman H.
Previdi, S.
Orlandy, D.
Steward, A.
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE OF THE ASSOCIATION OF PSYCHOLOGY AND PSYCHIATRY FOR ADULTS AND CHILDREN (A.P.P.A.C 2012), 2015, : 1 - 5
[27] Empowering Minds and Bodies: The Impact of Exercise on Multiple Sclerosis and Cognitive Health
Zameer, Ushna
Tariq, Amna
Asif, Fatima
Kamran, Ateeba
ANNALS OF NEUROSCIENCES, 2024, 31 (02) : 121 - 123
[28] Now I know! Empowering Voters with RAG-enabled LLMs to Eliminate Political Uncertainty
Vassos, Stavros
Goudelis, Stratos
Balaouras, Dimi
Vitalis, Giannis
Nakos, Vasilis
Pigka, Glykeria
Tsagkli, Loukia
Hatzikou, Menia
Tsionas, Zachos
Chasanis, Alexandros
van de Burgt, Stan
Pors, Mark
Papadoudis, Stratos
Loukas, Lefteris
PROCEEDINGS OF THE 13TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE, SETN 2024, 2024,
[29] EMPOWERING ENTREPRENEURSHIP THROUGH FORESIGHT AND INNOVATION: DEVELOPING A THEORETICAL FRAMEWORK FOR EMPOWERMENT IN ENTERPRISE PROGRAMS
O'Connor, Allan
Ramos, Jose M.
JOURNAL OF DEVELOPMENTAL ENTREPRENEURSHIP, 2006, 11 (03) : 207 - 231
[30] Generating Multimodal Augmentations with LLMs from Song Metadata for Music Information Retrieval
Rossetto, Federico
Dalton, Jeffrey
Murray-Smith, Roderick
PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 51 - 59

← 1 2 3 4 5 →