Merlin: Empowering Multimodal LLMs with Foresight Minds

被引:0
|
作者
Yu, En [1 ]
Zhao, Liang [2 ]
Wei, Yana [3 ]
Yang, Jinrong [3 ]
Wu, Dongming [4 ]
Kong, Lingyu [5 ]
Wei, Haoran [2 ]
Wang, Tiancai [2 ]
Ge, Zheng [2 ]
Zhang, Xiangyu [2 ]
Tao, Wenbing [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] MEGVII Technol, Beijing, Peoples R China
[3] ShanghaiTech Univ, Shanghai, Peoples R China
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Multimodal Large Language Model; Future Reasoning; HUMAN BRAIN;
D O I
10.1007/978-3-031-73235-5_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin's foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https://ahnsun.github.io/merlin.
引用
收藏
页码:425 / 443
页数:19
相关论文
共 50 条
  • [41] Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-Text Transformation
    Gou, Yunhao
    Chen, Kai
    Liu, Zhili
    Hong, Lanqing
    Xu, Hang
    Li, Zhenguo
    Yeung, Dit-Yan
    Kwok, James T.
    Zhang, Yu
    COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 388 - 404
  • [42] Empowering Minds: A Comprehensive Study of ECT Treatment in a Reference Mental Health Center in Portugal
    Barbosa Pinto, M.
    Viseu, M. T. D.
    Frias Goncalves, P.
    Gomes Pereira, E.
    EUROPEAN PSYCHIATRY, 2024, 67 : S240 - S241
  • [43] MIT OpenCourseWare: Unlocking knowledge, empowering minds (vol 329, pg 525, 2010)
    d'Oliveira, C.
    SCIENCE, 2010, 329 (5993) : 750 - 750
  • [44] Empowering LLMs by hybrid retrieval-augmented generation for domain-centric Q&A in smart manufacturing
    Wan, Yuwei
    Chen, Zheyuan
    Liu, Ying
    Chen, Chong
    Packianather, Michael
    ADVANCED ENGINEERING INFORMATICS, 2025, 65
  • [45] Adapting Components of the Multimodal Minds in Motion Activity Program into General Practice
    Kyrouac, Greg
    Helm, Susan
    Ala, Thomas
    GERONTOLOGY AND GERIATRIC MEDICINE, 2022, 8
  • [46] MultiCTox: Empowering Accurate Cardiotoxicity Prediction through Adaptive Multimodal Learning
    Feng, Lin
    Fu, Xiangzheng
    Du, Zhenya
    Guo, Yuting
    Zhuo, Linlin
    Yang, Yan
    Cao, Dongsheng
    Yao, Xiaojun
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2025,
  • [47] LimSim plus plus : A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving
    Fu, Daocheng
    Lei, Wenjie
    Wen, Licheng
    Cai, Pinlong
    Mao, Song
    Dou, Min
    Shi, Botian
    Qiao, Yu
    2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 1084 - 1090
  • [48] GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator
    Zhao, Henry Hengyuan
    Zhou, Pan
    Shou, Mike Zheng
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 129 - 147
  • [49] Empowering English as an Additional Language students through digital multimodal composing
    Barnes, Melissa
    Tour, Ekaterina
    LITERACY, 2023, 57 (02) : 106 - 119
  • [50] OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs
    Li, Jiahao Nick
    Xu, Yan
    Grossman, Tovi
    Santosa, Stephanie
    Li, Michelle
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,