Merlin: Empowering Multimodal LLMs with Foresight Minds

被引:0
|
作者
Yu, En [1 ]
Zhao, Liang [2 ]
Wei, Yana [3 ]
Yang, Jinrong [3 ]
Wu, Dongming [4 ]
Kong, Lingyu [5 ]
Wei, Haoran [2 ]
Wang, Tiancai [2 ]
Ge, Zheng [2 ]
Zhang, Xiangyu [2 ]
Tao, Wenbing [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] MEGVII Technol, Beijing, Peoples R China
[3] ShanghaiTech Univ, Shanghai, Peoples R China
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Multimodal Large Language Model; Future Reasoning; HUMAN BRAIN;
D O I
10.1007/978-3-031-73235-5_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin's foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https://ahnsun.github.io/merlin.
引用
收藏
页码:425 / 443
页数:19
相关论文
共 50 条
  • [31] Empowering Next-Generation Health Professionals with Futures Thinking and Strategic Foresight Skills
    Rogayan Jr, Danilo V.
    ANNALS OF BIOMEDICAL ENGINEERING, 2024, 52 (09) : 2309 - 2310
  • [32] Refllective Minds, Brighter Futures: Empowering Critical Refllection with a Guided Instructional Model
    James, Trixie
    Griffin, Hayley
    Johnston, Katrina S.
    Armstrong, Frank
    JOURNAL OF UNIVERSITY TEACHING AND LEARNING PRACTICE, 2023, 20 (06):
  • [33] Empowering minds: The role of disciplinary literacies in English-medium internationalised universities
    Dafouz, Emma
    LANGUAGE TEACHING, 2025,
  • [34] Empowering Users with ChatGPT and Similar Large Language Models (LLMs): Everyday Information Needs, Uses, and Gratification
    Ju, Boryung
    Stewart, J. Brenton
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 172 - 182
  • [35] Empowering Robots with Multimodal Language Models for Task Planning with Interaction
    Chung, Tong Lee
    Pang, Jianxin
    Cheng, Jun
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 358 - 362
  • [36] Empowering few-shot learning: a multimodal optimization framework
    Enamoto, Liriam
    Rocha Filho, Geraldo Pereira
    Weigang, Li
    Neural Computing and Applications, 2024,
  • [37] Empowering few-shot learning: a multimodal optimization framework
    Liriam Enamoto
    Geraldo Pereira Rocha Filho
    Li Weigang
    Neural Computing and Applications, 2025, 37 (5) : 3539 - 3560
  • [38] Empowering First Responders through Automated Multimodal Content Moderation
    Gupta, Divam
    Sen, Indira
    Sachdeva, Niharika
    Kumaraguru, Ponnurangam
    Buduru, Arun Balaji
    2018 IEEE INTERNATIONAL CONFERENCE ON COGNITIVE COMPUTING (ICCC), 2018, : 1 - 8
  • [39] Computing Architecture for Large-Language Models (LLMs) and Large Multimodal Models (LMMs)
    Liang, Bor-Sung
    PROCEEDINGS OF THE 2024 INTERNATIONAL SYMPOSIUM ON PHYSICAL DESIGN, ISPD 2024, 2024, : 233 - 234
  • [40] HiA: Towards Chinese Multimodal LLMs for Comparative High-Resolution Joint Diagnosis
    Ding, Xinpeng
    Chu, Yongqiang
    Pi, Renjie
    Wang, Hualiang
    Li, Xiaomeng
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 575 - 586