MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

被引:0
|
作者
Bellagente, Marco [4 ]
Brack, Manuel [2 ,3 ]
Teufel, Hannah [1 ]
Friedrich, Felix [3 ,6 ]
Deiseroth, Bjoern [1 ,3 ,6 ]
Eichenberg, Constantin [1 ]
Dai, Andrew [1 ]
Baldock, Robert J. N. [1 ]
Nanda, Souradeep [5 ]
Oostermeijer, Koen [1 ]
Cruz-Salinas, Andres Felipe [1 ]
Schramowski, Patrick [2 ,3 ,6 ,8 ]
Kersting, Kristian [2 ,3 ,6 ,7 ]
Weinbach, Samuel [1 ]
机构
[1] Aleph Alpha, Heidelberg, Germany
[2] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
[3] Tech Univ Darmstadt, Comp Sci Dept, Darmstadt, Germany
[4] Stabil AI, London, England
[5] Univ Texas Dallas, Dallas, TX USA
[6] Hessian AI, Darmstadt, Germany
[7] Tech Univ Darmstadt, Ctr Cognit Sci, Darmstadt, Germany
[8] LAION, Hamburg, Germany
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MULTIFUSION that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MULTIFUSION leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Multi-lingual and multi-modal speech processing and applications
    Ivanecky, J
    Fischer, J
    Mast, M
    Kunzmann, S
    Ross, T
    Fischer, V
    PATTERN RECOGNITION, PROCEEDINGS, 2005, 3663 : 149 - 159
  • [2] Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
    Xiao Wang
    Guangyao Chen
    Guangwu Qian
    Pengcheng Gao
    Xiao-Yong Wei
    Yaowei Wang
    Yonghong Tian
    Wen Gao
    Machine Intelligence Research, 2023, 20 : 447 - 482
  • [3] Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
    Wang, Xiao
    Chen, Guangyao
    Qian, Guangwu
    Gao, Pengcheng
    Wei, Xiao-Yong
    Wang, Yaowei
    Tian, Yonghong
    Gao, Wen
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 447 - 482
  • [4] Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers
    Pan, Haowen
    Cao, Yixin
    Wang, Xiaozhi
    Yang, Xun
    Wang, Meng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1012 - 1037
  • [5] Large Scale Multi-Lingual Multi-Modal Summarization Dataset
    Verma, Yash
    Jangra, Anubhav
    Kumar, Raghvendra
    Saha, Sriparna
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3620 - 3632
  • [6] PMMN: Pre-trained multi-Modal network for scene text recognition
    Zhang, Yu
    Fu, Zilong
    Huang, Fuyu
    Liu, Yizhi
    PATTERN RECOGNITION LETTERS, 2021, 151 : 103 - 111
  • [7] Probing Multi-modal Machine Translation with Pre-trained Language Model
    Kong, Yawei
    Fan, Kai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3689 - 3699
  • [8] Fast multi-modal reuse: Co-occurrence pre-trained deep learning models
    Iyer, Vasanth
    Aved, Alexander
    Howlett, Todd B.
    Carlo, Jeffrey T.
    Mehmood, Asif
    Pissinou, Niki
    Iyengar, S.S.
    Proceedings of SPIE - The International Society for Optical Engineering, 2019, 10996
  • [9] Difference between Multi-modal vs. Text Pre-trained Models in Embedding Text
    Sun Y.
    Cheng X.
    Song R.
    Che W.
    Lu Z.
    Wen J.
    Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2023, 59 (01): : 48 - 56
  • [10] Fast Multi-Modal Reuse: Co-Occurrence Pre-Trained Deep Learning Models
    Iyer, Vasanth
    Aved, Alexander
    Howlett, Todd B.
    Carlo, Jeffrey T.
    Mehmood, Asif
    Pissinou, Niki
    Iyengar, S. S.
    REAL-TIME IMAGE PROCESSING AND DEEP LEARNING 2019, 2019, 10996