MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

被引:0
|
作者
Bellagente, Marco [4 ]
Brack, Manuel [2 ,3 ]
Teufel, Hannah [1 ]
Friedrich, Felix [3 ,6 ]
Deiseroth, Bjoern [1 ,3 ,6 ]
Eichenberg, Constantin [1 ]
Dai, Andrew [1 ]
Baldock, Robert J. N. [1 ]
Nanda, Souradeep [5 ]
Oostermeijer, Koen [1 ]
Cruz-Salinas, Andres Felipe [1 ]
Schramowski, Patrick [2 ,3 ,6 ,8 ]
Kersting, Kristian [2 ,3 ,6 ,7 ]
Weinbach, Samuel [1 ]
机构
[1] Aleph Alpha, Heidelberg, Germany
[2] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
[3] Tech Univ Darmstadt, Comp Sci Dept, Darmstadt, Germany
[4] Stabil AI, London, England
[5] Univ Texas Dallas, Dallas, TX USA
[6] Hessian AI, Darmstadt, Germany
[7] Tech Univ Darmstadt, Ctr Cognit Sci, Darmstadt, Germany
[8] LAION, Hamburg, Germany
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MULTIFUSION that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MULTIFUSION leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] PluGeN: Multi-Label Conditional Generation from Pre-trained Models
    Wolczyk, Maciej
    Proszewska, Magdalena
    Maziarka, Lukasz
    Zieba, Maciej
    Wielopolski, Patryk
    Kurczab, Rafal
    Smieja, Marek
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8647 - 8656
  • [22] Multi-modal Mood Reader: Pre-trained Model Empowers Cross-Subject Emotion Recognition
    Dong, Yihang
    Chen, Xuhang
    Shen, Yanyan
    Ng, Michael Kwok-Po
    Qian, Tao
    Wang, Shuctiang
    NEURAL COMPUTING FOR ADVANCED APPLICATIONS, NCAA 2024, PT III, 2025, 2183 : 178 - 192
  • [23] Modal Consistency based Pre-Trained Multi-Model Reuse
    Yang, Yang
    Zhan, De-Chuan
    Guo, Xiang-Yu
    Jiang, Yuan
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3287 - 3293
  • [24] Deep Image Annotation and Classification by Fusing Multi-Modal Semantic Topics
    Chen, YongHeng
    Zhang, Fuquan
    Zuo, WanLi
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2018, 12 (01): : 392 - 412
  • [25] CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction
    Ma, Hao
    Peng, Zhiyuan
    Li, Xu
    Shao, Mingjie
    Wu, Xixin
    Liu, Ju
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4945 - 4960
  • [26] GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
    Huang, Jia-Hong
    Murn, Luka
    Mrak, Marta
    Worring, Marcel
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 580 - 589
  • [27] Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval
    Zhang, Liang
    Hu, Anwen
    Jin, Qin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [28] Deep Fusing Pre-trained Models into Neural Machine Translation
    Weng, Rongxiang
    Yu, Heng
    Luo, Weihua
    Zhang, Min
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11468 - 11476
  • [29] Graph Regularization for Multi-lingual Topic Models
    Jain, Arnav Kumar
    Arora, Gundeep
    Agrawal, Rahul
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1741 - 1744
  • [30] Fusing Multi-modal Features for Gesture Recognition
    Wu, Jiaxiang
    Cheng, Jian
    Zhao, Chaoyang
    Lu, Hanqing
    ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, : 453 - 459