Multimodal Foundation Models: From Specialists to General-Purpose Assistants

被引:12
|
作者
Li, Chunyuan [1 ]
Gan, Zhe [1 ]
Yang, Zhengyuan [1 ]
Yang, Jianwei [1 ]
Li, Linjie [1 ]
Wang, Lijuan [1 ]
Gao, Jianfeng [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
SEGMENTATION; NETWORKS;
D O I
10.1561/0600000110
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This monograph presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics - methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics - unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the monograph are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
引用
收藏
页码:1 / 214
页数:214
相关论文
共 50 条
  • [1] General-purpose foundation models for increased autonomy in robot-assisted surgery
    Schmidgall, Samuel
    Kim, Ji Woong
    Kuntz, Alan
    Ghazi, Ahmed Ezzat
    Krieger, Axel
    NATURE MACHINE INTELLIGENCE, 2024, 6 (11) : 1275 - 1283
  • [2] Towards a general-purpose foundation model for computational pathology
    Chen, Richard J.
    Ding, Tong
    Lu, Ming Y.
    Williamson, Drew F. K.
    Jaume, Guillaume
    Song, Andrew H.
    Chen, Bowen
    Zhang, Andrew
    Shao, Daniel
    Shaban, Muhammad
    Williams, Mane
    Oldenburg, Lukas
    Weishaupt, Luca L.
    Wang, Judy J.
    Vaidya, Anurag
    Le, Long Phi
    Gerber, Georg
    Sahai, Sharifa
    Williams, Walt
    Mahmood, Faisal
    NATURE MEDICINE, 2024, 30 (03) : 850 - 862
  • [3] Why Foundations? The Theory and Strategy of the General-Purpose Foundation
    Gill, Samsher Singh
    FOUNDATION REVIEW, 2023, 15 (04): : 79 - 101
  • [4] A logic foundation for a general-purpose history querying tool
    Stevens, Reinout
    De Roover, Coen
    Noguera, Carlos
    Kellens, Andy
    Jonckers, Viviane
    SCIENCE OF COMPUTER PROGRAMMING, 2014, 96 : 107 - 120
  • [5] Privacy Risks of General-Purpose Language Models
    Pan, Xudong
    Zhang, Mi
    Ji, Shouling
    Yang, Min
    2020 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2020), 2020, : 1314 - 1331
  • [6] General-Purpose Models in Biological and Computer Vision
    Elder, James
    PERCEPTION, 2015, 44 : 359 - 359
  • [7] A GENERAL-PURPOSE SIMULATION ENVIRONMENT FOR NEURAL MODELS
    MESROBIAN, E
    SKRZYPEK, J
    SIMULATION, 1992, 59 (05) : 286 - 299
  • [8] A GENERAL-PURPOSE SIMULATION ENVIRONMENT FOR DEVELOPING CONNECTIONIST MODELS
    DAUTRECHY, CL
    REGGIA, JA
    SUTTON, GG
    GOODALL, SM
    SIMULATION, 1988, 51 (01) : 5 - 19
  • [9] Towards Free Data Selection with General-Purpose Models
    Xie, Yichen
    Ding, Mingyu
    Tomizuka, Masayoshi
    Zhan, Wei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Benchmarking 360° Saliency Models by General-Purpose Metrics
    Li, Guanchen
    Sui, Xiangjie
    Yan, Jiebin
    Fang, Yuming
    2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,