Merlin: Empowering Multimodal LLMs with Foresight Minds

被引:0
|
作者
Yu, En [1 ]
Zhao, Liang [2 ]
Wei, Yana [3 ]
Yang, Jinrong [3 ]
Wu, Dongming [4 ]
Kong, Lingyu [5 ]
Wei, Haoran [2 ]
Wang, Tiancai [2 ]
Ge, Zheng [2 ]
Zhang, Xiangyu [2 ]
Tao, Wenbing [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] MEGVII Technol, Beijing, Peoples R China
[3] ShanghaiTech Univ, Shanghai, Peoples R China
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Multimodal Large Language Model; Future Reasoning; HUMAN BRAIN;
D O I
10.1007/978-3-031-73235-5_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans can foresee the future based on present observations, a skill we term as foresight minds. However, this capability remains under-explored within existing MLLMs, hindering their capacity to understand intentions behind subjects. To address this, we integrate the future modeling into MLLMs. By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. Inspired by the learning paradigm of LLMs, we first propose Foresight Pre-Training (FPT) that jointly learns various tasks centered on trajectories, enabling MLLMs to predict entire trajectories from a given initial observation. Then, we propose Foresight Instruction-Tuning (FIT) that requires MLLMs to reason about potential future events based on predicted trajectories. Aided by FPT and FIT, we build an unified MLLM named Merlin that supports complex future reasoning. Experiments show Merlin's foresight minds with impressive performance on both future reasoning and visual comprehension tasks. Project page: https://ahnsun.github.io/merlin.
引用
收藏
页码:425 / 443
页数:19
相关论文
共 50 条
  • [1] What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing
    Qi, Shuhan
    Cao, Zhengying
    Rao, Jun
    Wang, Lei
    Xiao, Jing
    Wang, Xuan
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (06)
  • [2] Education: Empowering Minds Selectively?
    Puttaraju, Sowmya
    Shastri, Shailaja
    INDIAN JOURNAL OF PSYCHOLOGICAL SCIENCE, 2014, 4 (02): : 94 - 100
  • [3] Multimodal AI & LLMs for Peacekeeping and Emergency Response
    Jaimes, Alejandro
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3 - 4
  • [4] EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs
    Zhao, Xiangyu
    Liu, Bo
    Liu, Qijiong
    Shi, Guangyuan
    Wu, Xiao-Ming
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 1351 - 1370
  • [5] ELLA: Empowering LLMs for Interpretable, Accurate and Informative Legal Advice
    Hu, Yutong
    Luo, Kangcheng
    Feng, Yansong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 374 - 387
  • [6] CURRICULUM REVOLUTION - REFLECTIVE MINDS AND EMPOWERING RELATIONSHIPS
    MIDDLEMISS, MA
    VANNESTEKENNY, J
    NURSING & HEALTH CARE, 1994, 15 (07): : 350 - 353
  • [7] Stage Wizard: Enhancing Tangible Storytelling with Multimodal LLMs
    Han, Kuntong
    Tang, Keyang
    Wang, Meng
    PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON TANGIBLE, EMBEDDED AND EMBODIED INTERACTION, TEI 2025, 2025,
  • [8] Empowering Education with LLMs - The Next-Gen Interface and Content Generation
    Moore, Steven
    Tong, Richard
    Singh, Anjali
    Liu, Zitao
    Hu, Xiangen
    Lu, Yu
    Liang, Joleen
    Cao, Chen
    Khosravi, Hassan
    Denny, Paul
    Brooks, Chris
    Stamper, John
    ARTIFICIAL INTELLIGENCE IN EDUCATION. POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2023, 2023, 1831 : 32 - 37
  • [9] The Multimodal Evocation of Minds in Audio Drama
    Bernaerts, Lars
    COUNTERTEXT-A JOURNAL FOR THE STUDY OF THE POST-LITERARY, 2019, 5 (03): : 312 - 331
  • [10] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
    Tong, Shengbang
    Liu, Zhuang
    Zhai, Yuexiang
    Ma, Yi
    Lecun, Yann
    Xie, Saining
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 9568 - 9578