Merlin: Empowering Multimodal LLMs with Foresight Minds

被引:0
|
作者
Yu, En [1 ]
Zhao, Liang [2 ]
Wei, Yana [3 ]
Yang, Jinrong [3 ]
Wu, Dongming [4 ]
Kong, Lingyu [5 ]
Wei, Haoran [2 ]
Wang, Tiancai [2 ]
Ge, Zheng [2 ]
Zhang, Xiangyu [2 ]
Tao, Wenbing [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] MEGVII Technol, Beijing, Peoples R China
[3] ShanghaiTech Univ, Shanghai, Peoples R China
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Multimodal Large Language Model; Future Reasoning; HUMAN BRAIN;
D O I
10.1007/978-3-031-73235-5_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin's foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https://ahnsun.github.io/merlin.
引用
收藏
页码:425 / 443
页数:19
相关论文
共 50 条
  • [21] Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation
    Chen, Hailin
    Saha, Amrita
    Hoi, Steven
    Joty, Shafiq
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6737 - 6749
  • [22] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
    You, Keen
    Zhang, Haotian
    Schoop, Eldon
    Weers, Floris
    Swearngin, Amanda
    Nichols, Jeffrey
    Yang, Yinfei
    Gan, Zhe
    COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 240 - 255
  • [23] Multimodal LLMs Struggle with Basic Visual Network Analysis: A VNA Benchmark
    Williams, Evan M.
    Carley, Kathleen M.
    SOCIAL, CULTURAL, AND BEHAVIORAL MODELING, SBP-BRIMS 2024, 2024, 14972 : 15 - 24
  • [24] Instruction Tuning-Free Visual Token Complement for Multimodal LLMs
    Wang, Dongsheng
    Cui, Jiequan
    Li, Miaoge
    Lin, Wang
    Chen, Bo
    Zhang, Hanwang
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 446 - 462
  • [25] Empowering minds and bodies: education as the beacon in Moria's health crisis
    Silaban, Ricky Alfredo
    Mahulae, Parno Sumanro
    Fitrianingrum, Aufa Maulida
    Kamaruddin
    Sudirham
    JOURNAL OF PUBLIC HEALTH, 2023,
  • [26] The Healthy Bodies and Healthy Minds program: Empowering children to make changes
    Vrailas, Bateman H.
    Previdi, S.
    Orlandy, D.
    Steward, A.
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE OF THE ASSOCIATION OF PSYCHOLOGY AND PSYCHIATRY FOR ADULTS AND CHILDREN (A.P.P.A.C 2012), 2015, : 1 - 5
  • [27] Empowering Minds and Bodies: The Impact of Exercise on Multiple Sclerosis and Cognitive Health
    Zameer, Ushna
    Tariq, Amna
    Asif, Fatima
    Kamran, Ateeba
    ANNALS OF NEUROSCIENCES, 2024, 31 (02) : 121 - 123
  • [28] Now I know! Empowering Voters with RAG-enabled LLMs to Eliminate Political Uncertainty
    Vassos, Stavros
    Goudelis, Stratos
    Balaouras, Dimi
    Vitalis, Giannis
    Nakos, Vasilis
    Pigka, Glykeria
    Tsagkli, Loukia
    Hatzikou, Menia
    Tsionas, Zachos
    Chasanis, Alexandros
    van de Burgt, Stan
    Pors, Mark
    Papadoudis, Stratos
    Loukas, Lefteris
    PROCEEDINGS OF THE 13TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE, SETN 2024, 2024,
  • [29] EMPOWERING ENTREPRENEURSHIP THROUGH FORESIGHT AND INNOVATION: DEVELOPING A THEORETICAL FRAMEWORK FOR EMPOWERMENT IN ENTERPRISE PROGRAMS
    O'Connor, Allan
    Ramos, Jose M.
    JOURNAL OF DEVELOPMENTAL ENTREPRENEURSHIP, 2006, 11 (03) : 207 - 231
  • [30] Generating Multimodal Augmentations with LLMs from Song Metadata for Music Information Retrieval
    Rossetto, Federico
    Dalton, Jeffrey
    Murray-Smith, Roderick
    PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 51 - 59