MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

被引:0
|
作者
Bellagente, Marco [4 ]
Brack, Manuel [2 ,3 ]
Teufel, Hannah [1 ]
Friedrich, Felix [3 ,6 ]
Deiseroth, Bjoern [1 ,3 ,6 ]
Eichenberg, Constantin [1 ]
Dai, Andrew [1 ]
Baldock, Robert J. N. [1 ]
Nanda, Souradeep [5 ]
Oostermeijer, Koen [1 ]
Cruz-Salinas, Andres Felipe [1 ]
Schramowski, Patrick [2 ,3 ,6 ,8 ]
Kersting, Kristian [2 ,3 ,6 ,7 ]
Weinbach, Samuel [1 ]
机构
[1] Aleph Alpha, Heidelberg, Germany
[2] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
[3] Tech Univ Darmstadt, Comp Sci Dept, Darmstadt, Germany
[4] Stabil AI, London, England
[5] Univ Texas Dallas, Dallas, TX USA
[6] Hessian AI, Darmstadt, Germany
[7] Tech Univ Darmstadt, Ctr Cognit Sci, Darmstadt, Germany
[8] LAION, Hamburg, Germany
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MULTIFUSION that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MULTIFUSION leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] The role of mental models in a multi-modal image search
    Frost, C
    ASIST 2001: PROCEEDINGS OF THE 64TH ASIST ANNUAL MEETING, VOL 38, 2001, 2001, 38 : 52 - 57
  • [32] Enhancing Image Classification Models with Multi-modal Biomarkers
    Caban, Jesus J.
    Liao, David
    Yao, Jianhua
    Mollura, Daniel J.
    Gochuico, Bernadette
    Yoo, Terry
    MEDICAL IMAGING 2011: COMPUTER-AIDED DIAGNOSIS, 2011, 7963
  • [33] Constructing a Multi-Modal Based Underwater Acoustic Target Recognition Method With a Pre-Trained Language-Audio Model
    Fu, Bowen
    Nie, Jiangtao
    Wei, Wei
    Zhang, Lei
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [34] MM-ConvBERT-LMS: Detecting Malicious Web Pages via Multi-Modal Learning and Pre-Trained Model
    Tong, Xin
    Jin, Bo
    Wang, Jingya
    Yang, Ying
    Suo, Qiwei
    Wu, Yong
    APPLIED SCIENCES-BASEL, 2023, 13 (05):
  • [35] Multi-target Backdoor Attacks for Code Pre-trained Models
    Li, Yanzhou
    Liu, Shangqing
    Chen, Kangjie
    Xie, Xiaofei
    Zhang, Tianwei
    Liu, Yang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7236 - 7254
  • [36] Cheap Bootstrap of Multi-Lingual Hidden Markov Models
    Falavigna, Daniele
    Gretter, Roberto
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2336 - +
  • [37] Multi-Modal Emotion Recognition Fusing Video and Audio
    Xu, Chao
    Du, Pufeng
    Feng, Zhiyong
    Meng, Zhaopeng
    Cao, Tianyi
    Dong, Caichao
    APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 455 - 462
  • [38] Controlled Multi-modal Image Generation for Plant Growth Modeling
    Miranda, Miro
    Drees, Lukas
    Roscher, Ribana
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 5118 - 5124
  • [39] Language Models for Multi-Lingual Tasks- A Survey
    Jafari, Amir Reza
    Heidary, Behnam
    Farahbakhsh, Reza
    Salehi, Mostafa
    Crespi, Noel
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (06) : 1458 - 1472
  • [40] Instruct-Imagen: Image Generation with Multi-modal Instruction
    Hu, Hexiang
    Chan, Kelvin C. K.
    Su, Yu-Chuan
    Chen, Wenhu
    Li, Yandong
    Sohn, Kihyuk
    Zhao, Yang
    Ben, Xue
    Gong, Boqing
    Chang, William Cohen Ming-Wei
    Jia, Xuhui
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4753 - 4763