Merlin: Empowering Multimodal LLMs with Foresight Minds

被引：0

作者：

Yu, En ^{[1
]}

Zhao, Liang ^{[2
]}

Wei, Yana ^{[3
]}

Yang, Jinrong ^{[3
]}

Wu, Dongming ^{[4
]}

Kong, Lingyu ^{[5
]}

Wei, Haoran ^{[2
]}

Wang, Tiancai ^{[2
]}

Ge, Zheng ^{[2
]}

Zhang, Xiangyu ^{[2
]}

Tao, Wenbing ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] MEGVII Technol, Beijing, Peoples R China

[3] ShanghaiTech Univ, Shanghai, Peoples R China

[4] Beijing Inst Technol, Beijing, Peoples R China

[5] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

COMPUTER VISION-ECCV 2024, PT IV | 2025年 / 15062卷

基金：

中国国家自然科学基金;

关键词：

Multimodal Large Language Model; Future Reasoning; HUMAN BRAIN;

D O I：

10.1007/978-3-031-73235-5_24

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin's foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https://ahnsun.github.io/merlin.

引用

页码：425 / 443

页数：19

共 50 条

[41] Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation
Gou, Yunhao
Chen, Kai
Liu, Zhili
Hong, Lanqing
Xu, Hang
Li, Zhenguo
Yeung, Dit-Yan
Kwok, James T.
Zhang, Yu
COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 388 - 404
[42] Empowering Minds: A Comprehensive Study of ECT Treatment in a Reference Mental Health Center in Portugal
Barbosa Pinto, M.
Viseu, M. T. D.
Frias Goncalves, P.
Gomes Pereira, E.
EUROPEAN PSYCHIATRY, 2024, 67 : S240 - S241
[43] MIT OpenCourseWare: Unlocking knowledge, empowering minds (vol 329, pg 525, 2010)
d'Oliveira, C.
SCIENCE, 2010, 329 (5993) : 750 - 750
[44] Empowering LLMs by hybrid retrieval-augmented generation for domain-centric Q&A in smart manufacturing
Wan, Yuwei
Chen, Zheyuan
Liu, Ying
Chen, Chong
Packianather, Michael
ADVANCED ENGINEERING INFORMATICS, 2025, 65
[45] Adapting Components of the Multimodal Minds in Motion Activity Program into General Practice
Kyrouac, Greg
Helm, Susan
Ala, Thomas
GERONTOLOGY AND GERIATRIC MEDICINE, 2022, 8
[46] MultiCTox: Empowering Accurate Cardiotoxicity Prediction through Adaptive Multimodal Learning
Feng, Lin
Fu, Xiangzheng
Du, Zhenya
Guo, Yuting
Zhuo, Linlin
Yang, Yan
Cao, Dongsheng
Yao, Xiaojun
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2025,
[47] LimSim plus plus : A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving
Fu, Daocheng
Lei, Wenjie
Wen, Licheng
Cai, Pinlong
Mao, Song
Dou, Min
Shi, Botian
Qiao, Yu
2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 1084 - 1090
[48] GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator
Zhao, Henry Hengyuan
Zhou, Pan
Shou, Mike Zheng
COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 129 - 147
[49] Empowering English as an Additional Language students through digital multimodal composing
Barnes, Melissa
Tour, Ekaterina
LITERACY, 2023, 57 (02) : 106 - 119
[50] OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs
Li, Jiahao Nick
Xu, Yan
Grossman, Tovi
Santosa, Stephanie
Li, Michelle
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,

← 1 2 3 4 5 →