MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

被引:0
|
作者
Bellagente, Marco [4 ]
Brack, Manuel [2 ,3 ]
Teufel, Hannah [1 ]
Friedrich, Felix [3 ,6 ]
Deiseroth, Bjoern [1 ,3 ,6 ]
Eichenberg, Constantin [1 ]
Dai, Andrew [1 ]
Baldock, Robert J. N. [1 ]
Nanda, Souradeep [5 ]
Oostermeijer, Koen [1 ]
Cruz-Salinas, Andres Felipe [1 ]
Schramowski, Patrick [2 ,3 ,6 ,8 ]
Kersting, Kristian [2 ,3 ,6 ,7 ]
Weinbach, Samuel [1 ]
机构
[1] Aleph Alpha, Heidelberg, Germany
[2] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
[3] Tech Univ Darmstadt, Comp Sci Dept, Darmstadt, Germany
[4] Stabil AI, London, England
[5] Univ Texas Dallas, Dallas, TX USA
[6] Hessian AI, Darmstadt, Germany
[7] Tech Univ Darmstadt, Ctr Cognit Sci, Darmstadt, Germany
[8] LAION, Hamburg, Germany
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MULTIFUSION that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MULTIFUSION leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] MtArtGPT: A Multi-Task Art Generation System With Pre-Trained Transformer
    Jin, Cong
    Zhu, Ruolin
    Zhu, Zixing
    Yang, Lu
    Yang, Min
    Luo, Jiebo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6901 - 6912
  • [42] Fusing BO and LiDAR for SAR Image Translation with Multi-Modal Generative Adversarial Networks
    Zhu, Jiang
    Qing, Yuanyuan
    Lin, Zhiping
    Wen, Kilian
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [43] TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning
    Zhang, Zhebin
    Lu, Peng
    Jiang, Dawei
    Chen, Gang
    WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 341 - 355
  • [44] Cross-Modal Retrieval Algorithm for Image and Text Based on Pre-Trained Models and Encoders
    Chen X.
    Peng J.
    Zhang P.
    Luo Z.
    Ou Z.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2023, 46 (05): : 112 - 117
  • [45] Hybrid multi-document summarization using pre-trained language models
    Ghadimi, Alireza
    Beigy, Hamid
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 192
  • [46] TED TALK TEASER GENERATION WITH PRE-TRAINED MODELS
    Vico, Gianluca
    Niehues, Jan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8067 - 8071
  • [47] MaxFusion: Plug&Play Multi-modal Generation in Text-to-Image Diffusion Models
    Nair, Nithin Gopalakrishnan
    Valanarasu, Jeya Maria Jose
    Patel, Vishal M.
    COMPUTER VISION-ECCV 2024, PT XXXVIII, 2025, 15096 : 93 - 110
  • [48] Pre-Trained Language Models for Text Generation: A Survey
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    ACM COMPUTING SURVEYS, 2024, 56 (09)
  • [49] Leveraging pre-trained language models for code generation
    Soliman, Ahmed
    Shaheen, Samir
    Hadhoud, Mayada
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3955 - 3980
  • [50] Multi-modal lung ultrasound image classification by fusing image-based features and probe information
    Okolo, Gabriel Iluebe
    Katsigiannis, Stamos
    Ramzan, Naeem
    2022 IEEE 22ND INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE 2022), 2022, : 45 - 50