MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

被引：0

作者：

Bellagente, Marco ^{[4
]}

Brack, Manuel ^{[2
,3
]}

Teufel, Hannah ^{[1
]}

Friedrich, Felix ^{[3
,6
]}

Deiseroth, Bjoern ^{[1
,3
,6
]}

Eichenberg, Constantin ^{[1
]}

Dai, Andrew ^{[1
]}

Baldock, Robert J. N. ^{[1
]}

Nanda, Souradeep ^{[5
]}

Oostermeijer, Koen ^{[1
]}

Cruz-Salinas, Andres Felipe ^{[1
]}

Schramowski, Patrick ^{[2
,3
,6
,8
]}

Kersting, Kristian ^{[2
,3
,6
,7
]}

Weinbach, Samuel ^{[1
]}

机构：

[1] Aleph Alpha, Heidelberg, Germany

[2] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany

[3] Tech Univ Darmstadt, Comp Sci Dept, Darmstadt, Germany

[4] Stabil AI, London, England

[5] Univ Texas Dallas, Dallas, TX USA

[6] Hessian AI, Darmstadt, Germany

[7] Tech Univ Darmstadt, Ctr Cognit Sci, Darmstadt, Germany

[8] LAION, Hamburg, Germany

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MULTIFUSION that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MULTIFUSION leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

引用

页数：20

共 50 条

[1] Multi-lingual and multi-modal speech processing and applications
Ivanecky, J
Fischer, J
Mast, M
Kunzmann, S
Ross, T
Fischer, V
PATTERN RECOGNITION, PROCEEDINGS, 2005, 3663 : 149 - 159
[2] Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
Xiao Wang
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiao-Yong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
Machine Intelligence Research, 2023, 20 : 447 - 482
[3] Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
Wang, Xiao
Chen, Guangyao
Qian, Guangwu
Gao, Pengcheng
Wei, Xiao-Yong
Wang, Yaowei
Tian, Yonghong
Gao, Wen
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 447 - 482
[4] Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers
Pan, Haowen
Cao, Yixin
Wang, Xiaozhi
Yang, Xun
Wang, Meng
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1012 - 1037
[5] Large Scale Multi-Lingual Multi-Modal Summarization Dataset
Verma, Yash
Jangra, Anubhav
Kumar, Raghvendra
Saha, Sriparna
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3620 - 3632
[6] PMMN: Pre-trained multi-Modal network for scene text recognition
Zhang, Yu
Fu, Zilong
Huang, Fuyu
Liu, Yizhi
PATTERN RECOGNITION LETTERS, 2021, 151 : 103 - 111
[7] Probing Multi-modal Machine Translation with Pre-trained Language Model
Kong, Yawei
Fan, Kai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3689 - 3699
[8] Fast multi-modal reuse: Co-occurrence pre-trained deep learning models
Iyer, Vasanth
Aved, Alexander
Howlett, Todd B.
Carlo, Jeffrey T.
Mehmood, Asif
Pissinou, Niki
Iyengar, S.S.
Proceedings of SPIE - The International Society for Optical Engineering, 2019, 10996
[9] Difference between Multi-modal vs. Text Pre-trained Models in Embedding Text
Sun Y.
Cheng X.
Song R.
Che W.
Lu Z.
Wen J.
Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2023, 59 (01): : 48 - 56
[10] Fast Multi-Modal Reuse: Co-Occurrence Pre-Trained Deep Learning Models
Iyer, Vasanth
Aved, Alexander
Howlett, Todd B.
Carlo, Jeffrey T.
Mehmood, Asif
Pissinou, Niki
Iyengar, S. S.
REAL-TIME IMAGE PROCESSING AND DEEP LEARNING 2019, 2019, 10996

← 1 2 3 4 5 →